TopK provides a data frame-like syntax for querying documents. It features built-in semantic search, text search, vector search, metadata filtering as well as reranking capabilities.
With TopK’s declarative query builder, you can easily select fields, chain filters, and apply vector/text search in a composable manner.
In TopK, a query consists of multiple stages:
vector_distance()
or semantic_similarity()
or custom properties computed inside select()
k
results based on the provided logical expressionAll queries must have either TopK or Count collection stage.
You can stack multiple select and filter stages in a single query.
A typical query in TopK looks as follows:
The select()
function is used to initialize the select stage of a query. It accepts a key-value pair of field names and field expressions:
Use a field()
function to select fields from a document. In the select stage, you can also rename existing fields
or define computed fields using function expressions.
Function expressions are used to define computed fields that will be included in your query results. TopK currently supports three main function expressions:
vector_distance(field, vector)
: Computes distance between vectors for vector search. This function is available for all dense and sparse vector types.bm25_score()
: Calculates relevance scores using the BM25 algorithm for keyword searchsemantic_similarity(field, query)
: Measures semantic similarity between the provided text query and the field’s embeddingThe vector_distance()
function is used to compute the vector score between a query vector and a vector field in a collection.
There are multiple ways to represent a query vector:
Dense vectors:
[0.1, 0.2, 0.3, ...]
- Array of numbers resolved as a dense float32 vectorf32_vector([...])
- Helper function returning a dense float32 vectoru8_vector([...])
- Helper function returning a dense u8 vectorbinary_vector([...])
- Helper function returning a binary vectorSparse vectors:
{ 0: 0.1, 1: 0.2, 2: 0.3, ... }
- Mapping from index → value resolved as a sparse float32 vectorf32_sparse_vector({ ... })
- Helper function returning a sparse float32 vectoru8_sparse_vector({ ... })
- Helper function returning a sparse u8 vectorSee the Helper functions page for details on how to use vector helper functions.
To use the vector_distance()
function, you must have a vector index defined on the field you’re computing the vector distance against:
The BM25 score is a relevance score that can be used to score documents based on their text content.
To use the fn.bm25_score()
in your query, you must include a match
predicate in your filter stage.
To use the fn.bm25_score()
function, you must have a keyword index defined in your collection schema.
The semantic_similarity()
function is used to compute the similarity between a text query and a text field in a collection.
To use the semantic_similarity()
function, you must have a semantic index defined on the field you’re computing the similarity on.
TopK doesn’t only let you select static fields from your documents or computed fields using function expressions. You can also use TopK powerful expression language to select fields by chaining arbitrary logical expressions:
You can filter documents by metadata, keywords, custom properties computed inside select()
(e.g. vector similarity or BM25 score) and more. Filter expressions support all comparison operators: ==
, !=
, >
, >=
, <
, <=
, arithmetic operations: +
, -
, *
, /
, and boolean operators: |
and &
.
The match()
function is the backbone of keyword search in TopK.
It allows you to search for documents that contain specific keywords or phrases.
You can configure the match()
function to:
The match()
function accepts the following parameters:
String token to match. Can also contain multiple terms separated by a delimiter which is any non-alphanumeric character.
Field to match on. If not provided, the function will match on all fields.
Weight to use for matching. If not provided, the function will use the default weight(1.0).
Use all
parameter when a text must contain all terms(separated by a delimeter)
all
is false
(default) it’s an equivalent of OR
operatorall
is true
it’s an equivalent of AND
operatorSearching for a term like "catcher"
in your documents is as simple as using the match()
function in the filter stage of your query:
The match()
function can be configured to match all terms when using a delimiter.
A term delimiter is any non-alphanumeric character.
To ensure that all terms are matched, use the all
parameter:
You can give weight to specific terms by using the weight
parameter:
You can combine metadata filtering and keyword search in a single query by stacking multiple filter stages.
In the example below, we’re searching for documents that contain the keyword "catcher"
and were published in 1997
, or documents that were published between 1920
and 1980
.
When writing queries, you can use the following operators for field selection or filtering:
The and
operator can be used to combine multiple logical expressions.
The or
operator can be used to combine multiple logical expressions.
The eq
operator can be used to match documents that have a field with a specific value.
The ne
operator can be used to match documents that have a field with a value that is not equal to a specific value.
The gt
operator can be used to match documents that have a field with a value greater than a specific value.
The gte
operator can be used to match documents that have a field with a value greater than or equal to a specific value.
The lt
operator can be used to match documents that have a field with a value less than a specific value.
The lte
operator can be used to match documents that have a field with a value less than or equal to a specific value.
The starts_with
operator can be used on string fields to match documents that start with a given prefix. This is especially
useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id}
and starts_with
can
then be used to scope the query to a specific tenant.
The contains
operator can be used on string fields to match documents that include a specific substring. It is case-sensitive and is particularly useful in scenarios where you need to filter results based on a portion of a string.
The add
operator can be used to add two numbers.
The sub
operator can be used to subtract two numbers.
The mul
operator can be used to multiply two numbers.
The div
operator can be used to divide two numbers.
The not
helper can be used to negate a logical expression. It takes an expression as an argument and inverts its logic.
The is_null
operator can be used to match documents that have a field with a value that is null
.
The is_not_null
operator can be used to match documents that have a field with a value that is not null
.
All queries must have a collection stage. Currently, we only support topk()
and count()
collectors.
Use the topk()
function to return the top k
results. The topk()
function accepts the following parameters:
The logical expression to sort the results by.
The number of results to return.
Whether to sort the results in ascending order.
To get the top 10 results ordered by the title_similarity
field, you can use the following query:
Use the count()
function to get the total number of documents matching the query. If there are no filters then count()
will return the total number of documents in the collection.
When writing queries, remember that they all require the topk
or count
function at the end.
The rerank()
function is used to rerank the results of a query. Read more about it in our reranking guide.
TopK provides a data frame-like syntax for querying documents. It features built-in semantic search, text search, vector search, metadata filtering as well as reranking capabilities.
With TopK’s declarative query builder, you can easily select fields, chain filters, and apply vector/text search in a composable manner.
In TopK, a query consists of multiple stages:
vector_distance()
or semantic_similarity()
or custom properties computed inside select()
k
results based on the provided logical expressionAll queries must have either TopK or Count collection stage.
You can stack multiple select and filter stages in a single query.
A typical query in TopK looks as follows:
The select()
function is used to initialize the select stage of a query. It accepts a key-value pair of field names and field expressions:
Use a field()
function to select fields from a document. In the select stage, you can also rename existing fields
or define computed fields using function expressions.
Function expressions are used to define computed fields that will be included in your query results. TopK currently supports three main function expressions:
vector_distance(field, vector)
: Computes distance between vectors for vector search. This function is available for all dense and sparse vector types.bm25_score()
: Calculates relevance scores using the BM25 algorithm for keyword searchsemantic_similarity(field, query)
: Measures semantic similarity between the provided text query and the field’s embeddingThe vector_distance()
function is used to compute the vector score between a query vector and a vector field in a collection.
There are multiple ways to represent a query vector:
Dense vectors:
[0.1, 0.2, 0.3, ...]
- Array of numbers resolved as a dense float32 vectorf32_vector([...])
- Helper function returning a dense float32 vectoru8_vector([...])
- Helper function returning a dense u8 vectorbinary_vector([...])
- Helper function returning a binary vectorSparse vectors:
{ 0: 0.1, 1: 0.2, 2: 0.3, ... }
- Mapping from index → value resolved as a sparse float32 vectorf32_sparse_vector({ ... })
- Helper function returning a sparse float32 vectoru8_sparse_vector({ ... })
- Helper function returning a sparse u8 vectorSee the Helper functions page for details on how to use vector helper functions.
To use the vector_distance()
function, you must have a vector index defined on the field you’re computing the vector distance against:
The BM25 score is a relevance score that can be used to score documents based on their text content.
To use the fn.bm25_score()
in your query, you must include a match
predicate in your filter stage.
To use the fn.bm25_score()
function, you must have a keyword index defined in your collection schema.
The semantic_similarity()
function is used to compute the similarity between a text query and a text field in a collection.
To use the semantic_similarity()
function, you must have a semantic index defined on the field you’re computing the similarity on.
TopK doesn’t only let you select static fields from your documents or computed fields using function expressions. You can also use TopK powerful expression language to select fields by chaining arbitrary logical expressions:
You can filter documents by metadata, keywords, custom properties computed inside select()
(e.g. vector similarity or BM25 score) and more. Filter expressions support all comparison operators: ==
, !=
, >
, >=
, <
, <=
, arithmetic operations: +
, -
, *
, /
, and boolean operators: |
and &
.
The match()
function is the backbone of keyword search in TopK.
It allows you to search for documents that contain specific keywords or phrases.
You can configure the match()
function to:
The match()
function accepts the following parameters:
String token to match. Can also contain multiple terms separated by a delimiter which is any non-alphanumeric character.
Field to match on. If not provided, the function will match on all fields.
Weight to use for matching. If not provided, the function will use the default weight(1.0).
Use all
parameter when a text must contain all terms(separated by a delimeter)
all
is false
(default) it’s an equivalent of OR
operatorall
is true
it’s an equivalent of AND
operatorSearching for a term like "catcher"
in your documents is as simple as using the match()
function in the filter stage of your query:
The match()
function can be configured to match all terms when using a delimiter.
A term delimiter is any non-alphanumeric character.
To ensure that all terms are matched, use the all
parameter:
You can give weight to specific terms by using the weight
parameter:
You can combine metadata filtering and keyword search in a single query by stacking multiple filter stages.
In the example below, we’re searching for documents that contain the keyword "catcher"
and were published in 1997
, or documents that were published between 1920
and 1980
.
When writing queries, you can use the following operators for field selection or filtering:
The and
operator can be used to combine multiple logical expressions.
The or
operator can be used to combine multiple logical expressions.
The eq
operator can be used to match documents that have a field with a specific value.
The ne
operator can be used to match documents that have a field with a value that is not equal to a specific value.
The gt
operator can be used to match documents that have a field with a value greater than a specific value.
The gte
operator can be used to match documents that have a field with a value greater than or equal to a specific value.
The lt
operator can be used to match documents that have a field with a value less than a specific value.
The lte
operator can be used to match documents that have a field with a value less than or equal to a specific value.
The starts_with
operator can be used on string fields to match documents that start with a given prefix. This is especially
useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id}
and starts_with
can
then be used to scope the query to a specific tenant.
The contains
operator can be used on string fields to match documents that include a specific substring. It is case-sensitive and is particularly useful in scenarios where you need to filter results based on a portion of a string.
The add
operator can be used to add two numbers.
The sub
operator can be used to subtract two numbers.
The mul
operator can be used to multiply two numbers.
The div
operator can be used to divide two numbers.
The not
helper can be used to negate a logical expression. It takes an expression as an argument and inverts its logic.
The is_null
operator can be used to match documents that have a field with a value that is null
.
The is_not_null
operator can be used to match documents that have a field with a value that is not null
.
All queries must have a collection stage. Currently, we only support topk()
and count()
collectors.
Use the topk()
function to return the top k
results. The topk()
function accepts the following parameters:
The logical expression to sort the results by.
The number of results to return.
Whether to sort the results in ascending order.
To get the top 10 results ordered by the title_similarity
field, you can use the following query:
Use the count()
function to get the total number of documents matching the query. If there are no filters then count()
will return the total number of documents in the collection.
When writing queries, remember that they all require the topk
or count
function at the end.
The rerank()
function is used to rerank the results of a query. Read more about it in our reranking guide.