
This is the second of two articles about learned sparse embeddings. Be sure to check out the previous installment on BGE-M3, which includes some critical background for understanding how the SPLADE model works.
TL;DR
Vector databases rely on various embeddings to retrieve data and generate accurate output for users. Learned sparse embeddings combine the ability of sparse embeddings to match keywords with the ability of dense embeddings to power semantic searches.
Bidirectional Encoder Representations from Transformers (or BERT) is the underlying architecture that powers the SPLADE model. We covered how BERT creates embeddings from a query text string in the last installment.
What Is SPLADE?
The Sparse Lexical and Dense Embeddings (SPLADE) model is designed for information retrieval tasks, combining the strengths of sparse lexical representations with dense embeddings.
Before we get to SPLADE, we need to return to BERT. There are two pre-training tasks that underpin BERT, one of which is Masked Language Modeling (MLM). This process randomly hides components of the token, and trains the model to predict what would best fit there.
We used the following query to explain both BERT and BGE-M3, and we’ll use it again here for consistency.
Milvus is a vector database built for scalable similarity search.
You can see in the token generated below that MLM masks two components of the token.
This technique results in a model with deeper linguistic comprehension and structural awareness of language because it depends on adjacent tokens to replace the masked values with accurate predictions.
For every masked slot during the pre-training, the model uses the contextualized embedding from BERT (we called this Q), here we represent it as Q[i]
to output a probability distribution w_i
, with w_{ij}
denoting the likelihood that a specific BERT vocabulary token occupies the masked position. The length of this output vector w_i
matches the size of BERT’s extensive vocabulary, typically 30,522 words, and serves as a key learning signal for refining the model’s predictions.
(Note: Probabilities are made up for demonstration purposes.)
While the BERT architecture has some MLM built into it, SPLADE takes that application of MLM to the next level. The key difference is that once BERT generates tokens and embeddings, SPLADE applies MLM across all token positions, calculating the probability that each token corresponds to every word in BERT’s vocabulary. It also uses advanced processing to determine a weighted relevance for each vocabulary word for the input token, creating a learned sparse vector.
One of the key advantages of using SPLADE is that it identifies relevant terms that were not present in the original text. This provides a lot of flexibility and dynamism for end-result output by expanding the vector to include more tokens. This extends term-matching capabilities because returned results may contain relevant data beyond the literal scope of the original query string.
SPLADE in the Real World
SPLADE takes BERT embeddings and gives them higher resolution and density, making them more useful for search and retrieval tasks, especially those where the scope and term relevance matter. The following are a few real-world applications for the SPLADE model.
Search Engine Optimization and Enhancement
Improving Search Engine Relevance and Efficiency
SPLADE-generated learned sparse embeddings help search engines better understand user queries and retrieve documents that are both lexically and semantically relevant.
Benefits:
- Improved relevance: Provides more accurate and contextually relevant search results.
- Enhanced understanding: Better understands user intent, even for complex or vague queries.
- Scalability: Efficiently handles large-scale datasets due to its sparse representations.
- User satisfaction: Increases user satisfaction through more precise search outcomes.
E-Commerce Product Search and Recommendation
Enhanced Product Search and Personalized Recommendations
In e-commerce, SPLADE can improve the search functionality on online retail platforms by offering more accurate product search results. It can also enhance recommendation systems by understanding the nuanced preferences of users through their search and purchase history. This leads to better product discovery and personalized shopping experiences.
Benefits:
- Better product matching: Accurately matches search queries with relevant products.
- Personalization: Provides personalized recommendations based on user behavior and preferences.
- Conversion rates: Increases conversion rates by helping customers find what they are looking for more efficiently.
- Inventory management: Helps in better inventory management by understanding product demand.
Academic and Scientific Research
Enhanced Literature Search and Knowledge Discovery
You can use SPLADE to improve literature searches in academic and scientific research. Researchers often need to find relevant papers, articles and data within extensive academic databases. SPLADE’s ability to capture both lexical and semantic content can provide researchers with more precise and comprehensive search results, facilitating better knowledge discovery.
Benefits:
- Comprehensive search: Retrieves a broader range of relevant documents by understanding complex scientific queries.
- Time efficiency: Saves researchers’ time by providing more accurate results quickly.
- Interdisciplinary research: Aids in discovering connections between different fields of study.
- Research quality: Enhances the quality of research by ensuring that critical and relevant literature is not overlooked.
Conclusion
The SPLADE model’s ability to create learned sparse embeddings by combining sparse lexical representations with dense embeddings makes it exceptionally powerful for information retrieval tasks.
The post Using SPLADE to Generate Learned Sparse Embeddings appeared first on The New Stack.
SPLADE identifies relevant terms not in the original text, extending term-matching capabilities with relevant data beyond the original query string.