Prerequisites
- TopK account (Sign up here)
- TopK API key (Get an API key here)
semantic_index() to your collection schema and querying with fn.semantic_similarity():
semantic_index() is powered by Iso-ModernColBERT, TopK’s own multi-vector embedding model, combined with Sparse Multi-Vector Encoding (SMVE) for scalable retrieval and quantized MaxSim reranking.
Why multi-vector? Single-vector (dense) embeddings compress an entire document into one point in high-dimensional space.Multi-vector models like Iso-ModernColBERT keep one embedding per token, enabling token-level matching via MaxSim scoring. This consistently outperforms dense models on out-of-domain content, long documents, specific clauses, tables, and structured data.Read High-Quality Search, Out of the Box for benchmarks and a deep-dive into the architecture.
How to perform a semantic search
In the following example, we’ll:Define a collection schema
Semantic search is enabled by adding asemantic_index() to a text() field in the collection schema:
Add documents to the collection
Let’s add some documents to the collection:Run a semantic query
To search for documents based on semantic similarity, use thefn.semantic_similarity() function:
- The
semantic_similarity()function encodes the query"classic American novel"into multi-vector token embeddings using Iso-ModernColBERT and scores each document via quantized MaxSim — comparing every query token against every document token to find the best alignment. - Candidate retrieval is accelerated by SMVE, which uses fast sparse approximations to identify a small set of candidates before the full MaxSim pass.
- The results are ranked by their MaxSim score and the top 10 most relevant documents are returned.
Combining semantic and keyword search
For certain use cases, you might want to use a combination of keyword search and semantic search:ensuring your search results capture both exact matches and contextual meaning with a custom scoring function that’s best suited for your use case.