Query structure
In TopK, a query consists of multiple stages:- Select stage - Select static or computed fields that will be returned in the query results
- these fields can be used in stages such as Filter, TopK or Rerank
- Filter stage - Filter the documents that will be returned in the query results
- filters can be applied to static fields, computed fields such as
vector_distance()
orsemantic_similarity()
or custom properties computed insideselect()
- filters can be applied to static fields, computed fields such as
- TopK stage - Return the top
k
results based on the provided logical expression - Count stage - Return the total number of documents matching the query
- Rerank stage - Rerank the results
You can stack multiple select and filter stages in a single query.
Select
Theselect()
function is used to initialize the select stage of a query. It accepts a key-value pair of field names and field expressions:
Select expressions
Use afield()
function to select fields from a document. In the select stage, you can also rename existing fields
or define computed fields using function expressions.
Function expressions
Function expressions are used to define computed fields that will be included in your query results. TopK currently supports three main function expressions:vector_distance(field, vector)
: Computes distance between vectors for vector search. This function is available for all dense and sparse vector types.bm25_score()
: Calculates relevance scores using the BM25 algorithm for keyword searchsemantic_similarity(field, query)
: Measures semantic similarity between the provided text query and the field’s embedding
Vector distance
Thevector_distance()
function is used to compute the vector score between a query vector and a vector field in a collection.
There are multiple ways to represent a query vector:
-
Dense vectors:
[0.1, 0.2, 0.3, ...]
- Array of numbers resolved as a dense float32 vectorf32_vector([...])
- Helper function returning a dense float32 vectoru8_vector([...])
- Helper function returning a dense u8 vectorbinary_vector([...])
- Helper function returning a binary vector
-
Sparse vectors:
{ 0: 0.1, 1: 0.2, 2: 0.3, ... }
- Mapping from index → value resolved as a sparse float32 vectorf32_sparse_vector({ ... })
- Helper function returning a sparse float32 vectoru8_sparse_vector({ ... })
- Helper function returning a sparse u8 vector
See the Helper functions page for details on how to use vector helper functions.
skip_refine=True
to bypass the internal distance refinement step. This will improve performance for queries with larget top_k
at the cost of lower accuracy.
We don’t recommend using
skip_refine=True
unless you’re using large top_k
and a custom reranking model to get the final ranking.To use the
vector_distance()
function, you must have a vector index defined on the field you’re computing the vector distance against:BM25 Score
The BM25 score is a relevance score that can be used to score documents based on their text content. To use thefn.bm25_score()
in your query, you must include a match
predicate in your filter stage.
To use the
fn.bm25_score()
function, you must have a keyword index defined in your collection schema.Semantic similarity
Thesemantic_similarity()
function is used to compute the similarity between a text query and a text field in a collection.
To use the semantic_similarity()
function, you must have a semantic index defined on the field you’re computing the similarity on.
Advanced select expressions
TopK doesn’t only let you select static fields from your documents or computed fields using function expressions. You can also use TopK powerful expression language to select fields by chaining arbitrary logical expressions:Filtering
You can filter documents by metadata, keywords, custom properties computed insideselect()
(e.g. vector similarity or BM25 score) and more.
Filter expressions support all
Metadata filtering
Keyword search
Thematch()
function is the backbone of keyword search in TopK.
It allows you to search for documents that contain specific keywords or phrases.
You can configure the match()
function to:
- Match on multiple terms
- Match only on specific fields
- Use weights to prioritize certain terms
match()
function accepts the following parameters:
String token to match. Can also contain multiple terms separated by a delimiter which is any non-alphanumeric character.
Field to match on.
If not provided, the function will match on all fields.
Weight to use for matching.
If not provided, the function will use the default weight(1.0).
Use
all
parameter when a text must contain all terms(separated by a delimeter)- when
all
isfalse
(default) it’s an equivalent ofOR
operator - when
all
istrue
it’s an equivalent ofAND
operator
"catcher"
in your documents is as simple as using the match()
function in the filter stage of your query:
Match multiple terms
Thematch()
function can be configured to match all terms when using a delimiter.
A term delimiter is any non-alphanumeric character.
To ensure that all terms are matched, use the all
parameter:
Give weight to specific terms
You can give weight to specific terms by using theweight
parameter:
Combine keyword search and metadata filtering
You can combine metadata filtering and keyword search in a single query by stacking multiple filter stages. In the example below, we’re searching for documents that contain the keyword"catcher"
and were published in 1997
, or between 1920
and 1980
.
Operators
When writing queries, you can use the following operators for:- field selection
- filtering
- topk collection
Logical operators
Logical operators combine multiple expressions by applying boolean logic and conditions.and
Theand
operator can be used to combine multiple logical expressions.
or
Theor
operator can be used to combine multiple logical expressions.
not
Thenot
helper can be used to negate a logical expression. It takes an expression as an argument and inverts its logic.
all
Theall()
helper evaluates to true
if each expression in the array is true. It’s equivalent to applying the logical AND operator across all expressions.
any
Theany()
helper evaluates to true
if at least one expression in the array is true. It’s equivalent to applying the logical OR operator across all expressions.
choose
Thechoose
operator evaluates a condition and returns the first argument if the condition is true, else the second argument.
boost
Theboost
operator multiplies the scoring expression by the provided boost
value if the condition
is true.
Otherwise, the scoring expression is unchanged (multiplied by 1).
coalesce
Thecoalesce
operator replaces null
values with a provided value.
Comparison operators
Comparison operators provide various logical, numerical and string functions that evaluate to true or false.eq
Theeq
operator can be used to match documents that have a field with a specific value.
ne
Thene
operator can be used to match documents that have a field with a value that is not equal to a specific value.
is_null
Theis_null
operator can be used to match documents that have a field with a value that is null
.
is_not_null
Theis_not_null
operator can be used to match documents that have a field with a value that is not null
.
gt
Thegt
operator can be used to match documents that have a field with a value greater than a specific value.
For strings, it uses lexicographic order.
gte
Thegte
operator can be used to match documents that have a field with a value greater than or equal to a specific value.
For strings, it uses lexicographic order.
lt
Thelt
operator can be used to match documents that have a field with a value less than a specific value.
For strings, it uses lexicographic order.
lte
Thelte
operator can be used to match documents that have a field with a value less than or equal to a specific value.
For strings, it uses lexicographic order.
starts_with
Thestarts_with
operator can be used on string fields to match documents that start with a given prefix. This is especially
useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id}
and starts_with
can
then be used to scope the query to a specific tenant.
contains
Thecontains
operator can be used on both text fields and list fields to match documents that include a specific value. For text fields, it matches documents that include a specific substring (case-sensitive). For list fields, it matches documents where the field of type list contains the specified value.
- Text fields: Matches documents that include a specific substring. It is case-sensitive and avoids the text processing pipeline (tokenization and stemming) used by the
match()
function. This makes it particularly useful when you need exact substring matching or want to provide your own pre-processed tokens. Unlikematch()
, thecontains
operator can be used without requiring a keyword index. - List fields: Matches documents where the list field contains the specified value. The value can be a literal or a field reference. You can also use a list of strings with a keyword index if you want to provide your own tokens instead of using the text processing pipeline.
The
contains
operator works exactly the same as the in
operator, but with reversed operands: x CONTAINS y
is equivalent to y IN x
. Both operators are provided for convenience and to make queries more readable.in
Thein
(or in_
in Python) operator checks if a field value is present in a list of values, string literal or another field. It can be used in several ways:
- Field in list: Checks if a field value is present in a list of literal values.
- Field in string: Checks if a string field is a substring of another string. Unlike the
match()
, this avoids the text processing pipeline (tokenization and stemming) and performs exact substring matching. - Field in field: Checks if a field value is present in another field.
The
in
operator works exactly the same as the contains
operator, but with reversed operands: y IN x
is equivalent to x CONTAINS y
. Both operators are provided for convenience and to make queries more readable.match_all
Thematch_all
operator returns true
if all terms in the query are present in the field with a keyword index.
When using a
match_all
operator against a text field, it must be used in conjunction with a keyword index defined in your collection schema.match_any
Thematch_any
operator returns true
if any term in the query is present in the field with a keyword index.
When using a
match_any
operator against a text field, it must be used in conjunction with a keyword index defined in your collection schema.Mathematical operators
Mathematical operators perform computations on numbers.add
Theadd
operator can be used to add two numbers.
sub
Thesub
operator can be used to subtract two numbers.
mul
Themul
operator can be used to multiply two numbers.
div
Thediv
operator can be used to divide two numbers.
abs
Theabs
operator returns the absolute value of a number, which is useful for calculating distances or differences.
min
Themin
operator returns the smaller of two values, commonly used for clamping or setting upper bounds. It can work with both scalar values and other fields or expressions.
For strings, it uses lexicographic order.
max
Themax
operator returns the larger of two values, commonly used for clamping or setting lower bounds. It can work with both scalar values and other fields or expressions.
For strings, it uses lexicographic order.
ln
Theln
operator calculates the natural logarithm, useful for logarithmic scaling and dampening large values.
exp
Theexp
operator calculates the exponential function (e^x), useful for exponential scaling and boosting.
sqrt
Thesqrt
operator calculates the square root, useful for dampening values and creating non-linear transformations.
square
Thesquare
operator multiplies a number by itself (x²), useful for amplifying differences and creating quadratic transformations.
Collection
All queries must have a collection stage. Currently, we only supporttopk()
and count()
collectors.
topk
Use thetopk()
function to return the top k
results. The topk()
function accepts the following parameters:
The logical expression to sort the results by.
The number of results to return.
Whether to sort the results in ascending order.
title_similarity
, you can use the following query:
count
Use thecount()
function to get the total number of documents matching the query. If there are no filters then count()
will return the total number of documents in the collection.
When writing queries, remember that they all require the
topk
or count
function at the end.Rerank
Thererank()
function is used to rerank the results of a query. Read more about it in our reranking guide.
LSN-based Consistency
TopK supports LSN (Log Sequence Number) based consistency for ensuring read-after-write consistency. When you perform a write operation (likeupsert
), you receive an LSN as a string that represents the sequence number of that write in the system’s log.
You can use this LSN in subsequent queries to ensure that the query only returns results that are at least as recent as that write operation.
How it works
- Write operation: When you call
lsn = client.collection().upsert()
, you receive an LSN - Query with LSN: Pass that LSN to
client.collection().query(..., lsn=lsn)
- Consistency guarantee: If the write is not yet available in the read path, the query will be rejected and the client will automatically retry
Using LSN-based consistency may increase query latency as the system needs to verify that the specified LSN has been processed before returning results.