Query documents
TopK provides a data frame-like syntax for querying documents. It features built-in semantic search, text search, vector search, metadata filtering as well as reranking capabilities.
With TopK’s declarative query builder, you can easily select fields, chain filters, and apply vector/text search in a composable manner.
Query structure
In TopK, a query consists of multiple stages:
- Select stage - Select static or computed fields that will be returned in the query results
- these fields can be used in stages such as Filter, TopK or Rerank
- Filter stage - Filter the documents that will be returned in the query results
- filters can be applied to static fields, computed fields such as
vector_distance()
orsemantic_similarity()
or custom properties computed insideselect()
- filters can be applied to static fields, computed fields such as
- TopK stage - Return the top
k
results based on the provided logical expression - Count stage - Return the total number of documents matching the query
- Rerank stage - Rerank the results
All queries must have either TopK or Count collection stage.
You can stack multiple select, filter and rerank stages in a single query.
A typical query in TopK looks as follows:
Select
The select()
function is used to initialize the select stage of a query. It accepts a key-value pair of field names and field expressions:
Select expressions
Use a field()
function to select fields from a document. In the select stage, you can also rename existing fields
or define computed fields using function expressions.
Function expressions
Function expressions are used to define computed fields that will be included in your query results. TopK currently supports three main function expressions:
bm25_score()
: Calculates relevance scores using the BM25 algorithm for keyword searchvector_distance(field, vector)
: Computes distance between vectors for vector searchsemantic_similarity(field, query)
: Measures semantic similarity between the provided text query and the field’s embedding
BM25 Score
The BM25 score is a relevance score that can be used to score documents based on their text content.
To use the fn.bm25_score()
in your query, you must include a match
predicate in your filter stage.
To use the fn.bm25_score()
function, you must have a keyword index defined in your collection schema.
Vector distance
The vector_distance()
function is used to compute the distance between a query vector and a vector field in a collection.
To use the vector_distance()
function, you must have a vector index defined on the field you’re computing the vector distance against:
Semantic similarity
The semantic_similarity()
function is used to compute the similarity between a text query and a text field in a collection.
To use the semantic_similarity()
function, you must have a semantic index defined on the field you’re computing the similarity on.
Advanced select expressions
TopK doesn’t only let you select static fields from your documents or computed fields using function expressions. You can also use TopK powerful expression language to select fields by chaining arbitrary logical expressions:
Filtering
You can filter documents by metadata, keywords, custom properties computed inside select()
(e.g. vector similarity or BM25 score) and more. Filter expressions support all comparison operators: ==
, !=
, >
, >=
, <
, <=
, arithmetic operations: +
, -
, *
, /
, and boolean operators: |
and &
.
Metadata filtering
Keyword search
The match()
function is the backbone of keyword search in TopK.
It allows you to search for documents that contain specific keywords or phrases.
You can configure the match()
function to:
- Match on multiple terms
- Match only on specific fields
- Use weights to prioritize certain terms
The match()
function accepts the following parameters:
String token to match. Can also contain multiple terms separated by a delimiter which is any non-alphanumeric character.
Field to match on. If not provided, the function will match on all fields.
Weight to use for matching. If not provided, the function will use the default weight(1.0).
Use all
parameter when a text must contain all terms(separated by a delimeter)
- when
all
isfalse
(default) it’s an equivalent ofOR
operator - when
all
istrue
it’s an equivalent ofAND
operator
Searching for a term like "catcher"
in your documents is as simple as using the match()
function in the filter stage of your query:
Match multiple terms
The match()
function can be configured to match all terms when using a delimiter.
A term delimiter is any non-alphanumeric character.
To ensure that all terms are matched, use the all
parameter:
Give weight to specific terms
You can give weight to specific terms by using the weight
parameter:
Combine keyword search and metadata filtering
You can combine metadata filtering and keyword search in a single query by stacking multiple filter stages.
In the example below, we’re searching for documents that contain the keyword "catcher"
and were published in 1997
, or documents that were published between 1920
and 1980
.
Operators
When writing queries, you can use the following operators for field selection or filtering:
Logical operators
and
The and
operator can be used to combine multiple logical expressions.
or
The or
operator can be used to combine multiple logical expressions.
Comparison operators
eq
The eq
operator can be used to match documents that have a field with a specific value.
ne
The ne
operator can be used to match documents that have a field with a value that is not equal to a specific value.
gt
The gt
operator can be used to match documents that have a field with a value greater than a specific value.
gte
The gte
operator can be used to match documents that have a field with a value greater than or equal to a specific value.
lt
The lt
operator can be used to match documents that have a field with a value less than a specific value.
lte
The lte
operator can be used to match documents that have a field with a value less than or equal to a specific value.
starts_with
The starts_with
operator can be used on string fields to match documents that start with a given prefix. This is especially
useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id}
and starts_with
can
then be used to scope the query to a specific tenant.
contains
The contains
operator can be used on string fields to match documents that include a specific substring. It is case-sensitive and is particularly useful in scenarios where you need to filter results based on a portion of a string.
Arithmetic operators
add
The add
operator can be used to add two numbers.
sub
The sub
operator can be used to subtract two numbers.
mul
The mul
operator can be used to multiply two numbers.
div
The div
operator can be used to divide two numbers.
Unary operators
not
The not
helper can be used to negate a logical expression. It takes an expression as an argument and inverts its logic.
is_null
The is_null
operator can be used to match documents that have a field with a value that is null
.
is_not_null
The is_not_null
operator can be used to match documents that have a field with a value that is not null
.
Collection
All queries must have a collection stage. Currently, we only support topk()
and count()
collectors.
topk
Use the topk()
function to return the top k
results. The topk()
function accepts the following parameters:
The logical expression to sort the results by.
The number of results to return.
Whether to sort the results in ascending order.
To get the top 10 results ordered by the title_similarity
field, you can use the following query:
count
Use the count()
function to get the total number of documents matching the query. If there are no filters then count()
will return the total number of documents in the collection.
When writing queries, remember that they all require the topk
or count
function at the end.
Rerank
The rerank()
function is used to rerank the results of a query. Read more about it in our reranking guide.