Similar to dense vector search, sparse vector search is an essential feature of TopK. With that in mind, it is designed to:
- Provide 100% recall.
- Support learned sparse vector representations, for example SPLADE.
- Provide consistent low latency (<20ms p99). Check out our benchmarks.
- Support large-scale single-collection as well as multi-tenant use cases.
How to run a sparse vector search
Prerequisites
Define a schema with a sparse vector field, either f32_sparse_vector() or u8_sparse_vector(), and add a vector_index() to it:
from topk_sdk.schema import text, f32_sparse_vector, vector_index
client.collections().create(
"books",
schema={
"title": text().required(),
"title_embedding": f32_sparse_vector()
.required()
.index(vector_index(metric = "dot_product")),
},
)
Sparse vectors do not have a fixed dimension, so you don’t need to specify the vector dimension when defining the field.
TopK only supports dot_product metric for sparse vectors which is compatible with both fixed and learned sparse
vector representations.
Find the most relevant documents
To find the top-k most relevant results for the query, use the vector_distance() function.
This function computes a score between the provided sparse query and indexed sparse vectors which
can then be used to sort the results.
from topk_sdk.query import select, field, fn
from topk_sdk.data import f32_sparse_vector
docs = client.collection("books").query(
select(
"title",
published_year=field("published_year"),
# Compute relevance score between the sparse vector embedding of the string "epic fantasy adventure"
# and the embedding stored in the `title_embedding` field.
title_score=fn.vector_distance(
"title_embedding",
f32_sparse_vector({0: 0.12, 6: 0.67, ...}),
)
)
# Return top 10 results
.topk(field("title_score"), 10)
)
# Example results:
[
{
"_id": "2",
"title": "Lord of the Rings",
"title_score": 0.8150404095649719
},
{
"_id": "1",
"title": "The Catcher in the Rye",
"title_score": 0.7825378179550171,
}
]
Let’s break down the example above:
- Compute the sparse dot product between the query embedding and the
title_embedding field using the vector_distance() function.
- Store the computed dot product score in the
title_score field.
- Return the top 10 results sorted by the
title_score field in a descending order.
Sparse vector search can be easily combined with metadata filtering by adding a filter() stage to the query:
from topk_sdk.query import select, field, fn
from topk_sdk.data import f32_sparse_vector
docs = client.collection("books").query(
select(
"title",
title_score=fn.vector_distance(
"title_embedding",
f32_sparse_vector({0: 0.12, 6: 0.67, ...}),
)
published_year=field("published_year"),
)
.filter(field("published_year") > 2000)
.topk(field("title_score"), 10)
)