OpenAI Releases Three Embedding Model Families To Optimize Text Search, Code Search and Text Similarity

In the last few decades, neural networks have been used for a wide range of tasks, including image segmentation, natural language processing, and time-series forecasting. 

One promising use of deep neural networks is embedding, a method for representing discrete variables as continuous vectors. An embedding is a low-dimensional space into which high-dimensional vectors can be translated, making it easy for computers to understand the relationships between those concepts. Numerically similar embeddings are also semantically identical. Word embeddings for machine translation and entity embeddings for categorical data are two applications of this approach. 

The OpenAI API’s new /embeddings endpoint allows users to embed text and code with just a few lines of code. OpenAI has recently released three embedding model families, each optimized to perform well in different functionalities: text similarity, text search, and code search. These can be used to easily perform tasks like semantic search, clustering, topic modeling, and classification using natural language and code. 

The new endpoint “embeds” text and code in a high-dimensional space using neural network models to transfer them to a vector representation. Each dimension represents a different component of the data. The results reveal that embeddings outperform top models in three typical benchmarks, including a 20% gain in code search.

Source: https://openai.com/blog/introducing-text-and-code-embeddings/

Text similarity models

Text similarity models generate embeddings that represent the semantic similarity of text fragments. The similarity between two bits of text can be compared by simply taking the dot product on the text embeddings to compare the similarity of two bits of text. The outcome is a “similarity score,” also known as “cosine similarity,” which ranges from –1 to 1, with a higher value suggesting greater similarity. In most cases, the embeddings can be computed ahead of time, making the dot product comparison exceedingly quick.

One common application of embeddings is as a feature in machine learning tasks like categorization. This classification job is referred to as a “linear probe” when utilizing a linear classifier. In SentEval, a widely used benchmark for measuring embedding quality, the text-similarity models attain new state-of-the-art linear probe classification results.

Text search models

In response to a text query, text search models provide embeddings enabling large-scale search tasks, such as locating a relevant document among a large number of documents. The query and each document’s embeddings are constructed individually, and their similarity is assessed using cosine similarity.

Because it captures the semantic meaning of the text and is less sensitive to specific phrases or words, an embedding-based search can generalize better than word overlap algorithms used in the traditional keyword search. The text search model outperforms earlier methods when tested on the BEIR search evaluation suite.

Code search models

The new code search models offer code and text embeddings for code search tasks. The aim is to discover the relevant code block for a natural language query from a set of code blocks. The code search models are evaluated using the CodeSearchNet evaluation suite. The findings show that these embeddings outperform previous methods significantly.

These embeddings are being used in many real-world applications. For example, Using OpenAI’s embeddings, JetBrains Research’s Astroparticle Physics Lab examines data from The Astronomer’s Telegram and NASA’s GCN Circulars. Researchers explore databases and publications for events like “crab pulsar bursts.” Studies also show that embeddings achieve 99.85% accuracy in data source classification.

Reference: https://openai.com/blog/introducing-text-and-code-embeddings/

Paper: https://arxiv.org/abs/2201.10005

Documentation: https://beta.openai.com/docs/guides/embeddings