Spotify Employs Natural Language Search/Semantic Search For Podcast Episodes

This article is based on Spotify/s research article 'Introducing Natural Language Search for Podcast Episodes'

Users don’t always input the precise words they are searching for. This requires search algorithms to compensate using fuzzy matching, normalization, and even manual aliases. While these strategies are extremely beneficial to the user, they have limitations in that they cannot capture all possible ways of expressing yourself in natural language, particularly when employing natural language sentences.

Until recently, Spotify’s search was primarily based on phrase matching. For example, if a user searches for “electric vehicles climate impact,” Elasticsearch will return search results. This returned result includes everything in its indexed metadata that contains each of the query words. However, such results do not guarantee that the relevant material for this query will be returned to the user.

This article is based on the recent work by Spotify team. The researchers at Spotify started exploring a technique called Natural Language Search (often known as Semantic Search in the literature) to help users find more relevant content with less effort. This method matches a query with a textual document that is semantically connected rather than requiring exact word matches. They thought that semantic matching would be most effective when looking for podcasts and thus used this method to retrieve podcast episodes as a first step. This solution is now used by the majority of Spotify users currently.


They used self-supervised learning and transformer neural networks to obtain results where none of the retrieved episodes had all of the query words in their titles. However, they were extremely relevant to the user’s query. Further, for rapid online serving, they used vector search techniques like Approximate Nearest Neighbor (ANN).

Dense Retrieval is a machine learning strategy that involves training a model that generates query and episode vectors in a shared embedding space. The objective here is to have the vectors of a search query and a relevant episode in the embedding spaced close together. They feed query text to the model and textual metadata fields from episodes (such as its title, description, and the title and description of its parent podcast show).

Then they decided to vector search techniques to efficiently obtain the episodes whose vectors are the closest to the query vector during live Search traffic.

At present, transformers models like BERT are used for most NLP problems. However, the model is only pretrained in English. Moreover, it focuses on developing high-quality contextual word embedding, although off-the-shelf sentence representations are lacking.

Self-supervised objective Conditional Masked Language Modeling (CMLM) was recently introduced to build high-quality phrase embeddings directly. The team used the Universal Sentence Encoder CMLM model to enable multilingual inquiries and episodes as their base model. The model has been pre-trained on a massive multilingual corpus with over 100 languages.

The team fine-tuned the pre-trained Transformer model on their target task of performing Natural Language Search on Spotify’s podcast episodes. To that end, they preprocess a variety of data types:

  • They take podcast searches from their search logs and pair them with (query, episode). Previous Elasticsearch queries and their returned results account for the majority of those successes. They also looked at user sessions from previous search logs to see whether any attempts at search queries were successful after an initial search failed.
  • They constructed synthetic inquiries from popular episode names and descriptions to expand and diversify the training set. To generate pairs, they fine-tuned a BART transformer model on the MS MARCO dataset. 

Positive (query, episode) and negative (question, episode) pairs are important to train the model adequately. The team employed an approach dubbed in-batch negatives to efficiently generate random negative pairs during training. 

They use the following two models to assess the models:

  1. In-batch metrics: This can quickly compute metrics like Recall@1 and Mean Reciprocal Rank (MRR) at the batch level using in-batch negatives.
  2. Full-retrieval setting metrics: During training, the vectors of all episodes were computed in their evaluation set. They also compute metrics using queries from the same eval set. This method evaluated the model in a more realistic environment with more candidates than in a batch. 

In an offline pipeline, episode vectors are pre-computed for a large number of episodes using an episode encoder. These vectors are then indexed in the Vespa search engine, which supports ANN Search. With ANN, retrieval latency on tens of millions of indexed episodes is reasonable while retrieval metrics are minimally impacted. 

The query vector is calculated by using Google Cloud Vertex AI, where the query encoder is deployed as a user types a query. The support for GPU inference in Vertex AI is one of the key reasons why the researchers used it. The query vector is then used to retrieve the top 30 “semantic podcast episodes” from Vespa. A vector cache is also used to reduce the number of times the same query vectors are computed.

Although Dense Retrieval / Natural Language Search offers many features, it often falls short of classic IR methods when it comes to specific term matching. Moreover, it is also more expensive to run on all queries. That’s why, rather than simply replacing their current retrieval sources, the researchers opted to make Natural Language Search an extra source.

The final-stage reranking algorithm in Spotify Search takes the top candidates from each retrieval source and produces the final ranking for the user to see. The team added the (query, episode) cosine similarity value to this model’s input characteristics to help it rank semantic possibilities better.


Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.