Pinecone Algorithms Stack Up Across the BigANN Tracks: Outperforming the Winners by up to 2x

The Billion-Scale Approximate Nearest Neighbor Search Challenge, part of the NeurIPS competition track, aims to advance research in large-scale ANNS (Approximate Nearest Neighbor Search). 

BigANN is a collaborative arena where the best minds in the field come together to push the boundaries of vector search technology. Participants face four distinct tracks, each tackling a different aspect of the challenge:

  • Filter Track: This track focuses on efficiently finding nearest neighbors while filtering results based on specific tags or metadata.
  • Sparse Track: It addresses the challenge of searching for nearest neighbors in high-dimensional spaces with empty dimensions.
  • Streaming Track: This track tests the ability of algorithms to adapt quickly to new data being added or removed in real time.
  • Out-of-Distribution (OOD) Track: It evaluates performance in cross-modal search scenarios where queries come from a distribution different from the indexed vectors.

Pinecone, a vector database company, has participated in co-organizing this competition. They invested in developing new algorithms and optimizing existing techniques to compete with other teams. Pinecone’s methods showed an outstanding performance in all four tracks, achieving up to twice the performance of the next best entry.

Pinecone’s Algorithm 

1. Filter Track Algorithm

In the Filter track, Pinecone’s algorithm uses a classic IVF setup. It breaks down data into clusters and creates a special list (inverted index) for each tag describing the data. When a query is presented, its selectivity level is evaluated, representing the count of vectors passing the filter.

Then, it looks through various groups based on the query’s specificity, using the list to find only the relevant items. This makes the process faster, and the algorithm ensures efficiency by pre-computing certain details and using AVX for calculations. Hyperparameters are optimized using a constrained convex optimization problem on the public query set.

Result

https://www.pinecone.io/blog/pinecone-algorithms-set-new-records-for-bigann/

2. Sparse Track Algorithm 

In the Sparse track, Pinecone’s algorithm clusters sparse vectors and constructs an inverted index with a unique structure. The algorithm addresses the top cluster retrieval problem and finds the top-k vectors within those clusters using an anytime retrieval algorithm over the inverted index.

Additional lightweight components include a k-MIP graph for expanding the set of retrieved top-k vectors and a compressed forward index for re-ranking. This results in a hybrid solution combining IVF and graph-based methods.

Result

https://www.pinecone.io/blog/pinecone-algorithms-set-new-records-for-bigann/

3. OOD Track Algorithm 

The OOD track algorithm shares similarities with the Sparse track approach. It involves three main components: an inverted-file (IVF) index, a k-MIP graph constructed using the co-occurrence of vectors, and quantization for SIMD-based acceleration.

Retrieval occurs in three stages: scoring quantized vectors to retrieve candidates from top clusters, expanding the candidate set using the k-MIP graph, and scoring all candidates based on fine-grained quantized representations. Batch processing of queries is employed for accelerated search.

Result

https://www.pinecone.io/blog/pinecone-algorithms-set-new-records-for-bigann/

4. Streaming Track Algorithm 

Pinecone’s solution for the Streaming track adopts a two-stage retrieval strategy. Initially, a variant of the DiskANN index is used for candidate generation, producing a set of k’ >> k results through an approximate scoring mechanism over uint8-quantized vectors with SIMD-based distance calculation. 

The second stage involves re-ranking candidates using full-precision scoring to enhance retrieval accuracy. Notably, raw vectors used in the second stage are stored on SSD, emphasizing the importance of optimizing disk reads invoked by the re-ranking stage.

Result

https://www.pinecone.io/blog/pinecone-algorithms-set-new-records-for-bigann/

Conclusion

The BigANN challenge emphasized integrating new features into vector databases, focusing on both academic and industrial applications. Essential aspects include cost efficiency, data freshness, and the integration of advanced elements like filter queries. Pinecone is incorporating these insights and new algorithms into their vector index.

Manya Goyal is an AI and Research consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Guru Gobind Singh Indraprastha University(Bhagwan Parshuram Institute of Technology). She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is a podcaster on Spotify and is passionate about exploring.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...