TopK provides a data frame-like syntax for querying documents. It features built-in semantic search, text search, vector search, metadata filtering as well as reranking capabilities. With TopK’s declarative query builder, you can easily select fields, chain filters, and apply vector/text search in a composable manner.

Query structure

In TopK, a query consists of multiple stages:
  • Select stage - Select static or computed fields that will be returned in the query results
    • these fields can be used in stages such as Filter, TopK or Rerank
  • Filter stage - Filter the documents that will be returned in the query results
    • filters can be applied to static fields, computed fields such as vector_distance() or semantic_similarity() or custom properties computed inside select()
  • TopK stage - Return the top k results based on the provided logical expression
  • Count stage - Return the total number of documents matching the query
  • Rerank stage - Rerank the results
All queries must have either TopK or Count collection stage.
You can stack multiple select and filter stages in a single query.
A typical query in TopK looks as follows:

Select

The select() function is used to initialize the select stage of a query. It accepts a key-value pair of field names and field expressions:
from topk_sdk.query import select, field

client.collection("books").query(
  select(
    "published_year", # elect the static fields directly
    title=field("title"),
  )
  ...
)

Select expressions

Use a field() function to select fields from a document. In the select stage, you can also rename existing fields or define computed fields using function expressions.
from topk_sdk.query import select, field

docs = client.collection("books").query(
    select(
        "title", # the actual "title" field from the document
        year=field("published_year"), # renamed field
        year_plus_ten=field("published_year") + 10, # computed field
    )
)

Function expressions

Function expressions are used to define computed fields that will be included in your query results. TopK currently supports three main function expressions:
  • vector_distance(field, vector): Computes distance between vectors for vector search. This function is available for all dense and sparse vector types.
  • bm25_score(): Calculates relevance scores using the BM25 algorithm for keyword search
  • semantic_similarity(field, query): Measures semantic similarity between the provided text query and the field’s embedding

Vector distance

The vector_distance() function is used to compute the vector score between a query vector and a vector field in a collection. There are multiple ways to represent a query vector:
  • Dense vectors:
    • [0.1, 0.2, 0.3, ...] - Array of numbers resolved as a dense float32 vector
    • f32_vector([...]) - Helper function returning a dense float32 vector
    • u8_vector([...]) - Helper function returning a dense u8 vector
    • binary_vector([...]) - Helper function returning a binary vector
  • Sparse vectors:
    • { 0: 0.1, 1: 0.2, 2: 0.3, ... } - Mapping from index → value resolved as a sparse float32 vector
    • f32_sparse_vector({ ... }) - Helper function returning a sparse float32 vector
    • u8_sparse_vector({ ... }) - Helper function returning a sparse u8 vector
See the Helper functions page for details on how to use vector helper functions.
Optionally, uses can provide skip_refine=True to bypass the internal distance refinement step. This will improve performance for queries with larget top_k at the cost of lower accuracy.
We don’t recommend using skip_refine=True unless you’re using large top_k and a custom reranking model to get the final ranking.
To use the vector_distance() function, you must have a vector index defined on the field you’re computing the vector distance against:
from topk_sdk.query import select, field, fn

docs = client.collection("books").query(
    select(
        "title",
        title_similarity=fn.vector_distance(
          "title_embedding",
          [0.1, 0.2, 0.3, ...] # embedding for "animal"
        )
    )
    .topk(field("title_similarity"), 10)
)

# Example result:
[
  {
    "_id": "2",
    "title": "To Kill a Mockingbird",
    "title_similarity": 0.7484796643257141
  },
  {
    "_id": "1",
    "title": "The Catcher in the Rye",
    "title_similarity": 0.5471329569816589
  }
]

BM25 Score

The BM25 score is a relevance score that can be used to score documents based on their text content. To use the fn.bm25_score() in your query, you must include a match predicate in your filter stage.
To use the fn.bm25_score() function, you must have a keyword index defined in your collection schema.
from topk_sdk.query import select, field, fn, match

docs = client.collection("books").query(
    select(
        "title",
        text_score=fn.bm25_score(),
    )
    .filter(match("Good")) # must include a match predicate
    .topk(field("text_score"), 10)
)

# Example result:
[
  {
    "_id": "1",
    "title": "Good Night, Bat! Good Morning, Squirrel!",
    "text_score": 0.2447269707918167
  },
  {
    "_id": "2",
    "title": "Good Girl, Bad Blood",
    "text_score": 0.20035339891910553
  }
]

Semantic similarity

The semantic_similarity() function is used to compute the similarity between a text query and a text field in a collection. To use the semantic_similarity() function, you must have a semantic index defined on the field you’re computing the similarity on.
from topk_sdk.query import select, field, fn

docs = client.collection("books").query(
    select(
        "title",
        title_similarity=fn.semantic_similarity("title", "animal"),
    )
    .topk(field("title_similarity"), 10)
)

# Example result:
[
  {
    "_id": "2",
    "title": "To Kill a Mockingbird",
    "title_similarity": 0.7484796643257141
  },
  {
    "_id": "1",
    "title": "The Catcher in the Rye",
    "title_similarity": 0.5471329569816589
  }
]

Advanced select expressions

TopK doesn’t only let you select static fields from your documents or computed fields using function expressions. You can also use TopK powerful expression language to select fields by chaining arbitrary logical expressions:
from topk_sdk.query import select, field

select(
  weight_in_grams=field("weight").mul(1000),
  is_adult=field("age").gt(18),
  published_in_nineteenth_century=field("published_year") >= 1800 && field("published_year") < 1900,
)

Filtering

You can filter documents by metadata, keywords, custom properties computed inside select() (e.g. vector similarity or BM25 score) and more. Filter expressions support all

Metadata filtering

.filter(
    field("published_year") > 1980
)
The match() function is the backbone of keyword search in TopK. It allows you to search for documents that contain specific keywords or phrases. You can configure the match() function to:
  • Match on multiple terms
  • Match only on specific fields
  • Use weights to prioritize certain terms
The match() function accepts the following parameters:
token
string
required
String token to match. Can also contain multiple terms separated by a delimiter which is any non-alphanumeric character.
options.field
string
Field to match on. If not provided, the function will match on all fields.
options.weight
number
Weight to use for matching. If not provided, the function will use the default weight(1.0).
options.all
boolean
Use all parameter when a text must contain all terms(separated by a delimeter)
  • when all is false (default) it’s an equivalent of OR operator
  • when all is true it’s an equivalent of AND operator
Searching for a term like "catcher" in your documents is as simple as using the match() function in the filter stage of your query:
from topk_sdk.query import match

.filter(
    match("catcher")
)

Match multiple terms

The match() function can be configured to match all terms when using a delimiter. A term delimiter is any non-alphanumeric character. To ensure that all terms are matched, use the all parameter:
from topk_sdk.query import match

.filter(match("catcher|rye", field="title", all=True))

Give weight to specific terms

You can give weight to specific terms by using the weight parameter:
from topk_sdk.query import match

.filter(match("catcher", weight=2.0).or(match("rye", weight=1.0)))

Combine keyword search and metadata filtering

You can combine metadata filtering and keyword search in a single query by stacking multiple filter stages. In the example below, we’re searching for documents that contain the keyword "catcher" and were published in 1997, or documents that were published between 1920 and 1980.
.filter(
    match("catcher")
)
.filter(
    field("published_year") == 1997 || (field("published_year") >= 1920 && field("published_year") <= 1980)
)

Operators

When writing queries, you can use the following operators for:
  • field selection
  • filtering
  • topk collection

Logical operators

Logical operators combine multiple expressions by applying boolean logic and conditions.

and

The and operator can be used to combine multiple logical expressions.
.filter(
    field("published_year") == 1997 && field("title") == "The Catcher in the Rye"
)

# or

.filter(
    field("published_year").eq(1997).and_(field("title").eq("The Catcher in the Rye"))
)

or

The or operator can be used to combine multiple logical expressions.
.filter(
    field("published_year") == 1997 || field("title") == "The Catcher in the Rye"
)

# or

.filter(
    field("published_year").eq(1997).or(field("title").eq("The Catcher in the Rye"))
)

not

The not helper can be used to negate a logical expression. It takes an expression as an argument and inverts its logic.
from topk_sdk.query import field, not_

.filter(
    not_(field("title").contains("Catcher"))
)

choose

The choose operator evaluates a condition and returns the first argument if the condition is true, else the second argument.
select(
  summary=(field("book_type") == "fiction").choose(
      field("plot_summary"),
      field("technical_summary")
  )
)

boost

The boost operator multiplies the scoring expression by the provided boost value if the condition is true. Otherwise, the scoring expression is unchanged (multiplied by 1).
select(
  summary_distance=fn.vector_distance("summary_embedding", [2.3] * 16)
).topk(
  field("summary_distance").boost(field("summary").match_all("deep learning"), 1.5),
  10,
  false
)
# this boost expression is equivalent to
# field("summary_distance") * (field("summary").match_all("deep learning").choose(1.5, 1.0)),

coalesce

The coalesce operator replaces null values with a provided value.
select(importance=field("nullable_importance").coalesce(1.0))

Comparison operators

Comparison operators provide various logical, numerical and string functions that evaluate to true or false.

eq

The eq operator can be used to match documents that have a field with a specific value.
.filter(
    field("published_year") == 1997
)

# or

.filter(
    field("published_year").eq(1997)
)

ne

The ne operator can be used to match documents that have a field with a value that is not equal to a specific value.
.filter(
    field("published_year") != 1997
)

# or

.filter(
    field("published_year").ne(1997)
)

is_null

The is_null operator can be used to match documents that have a field with a value that is null.
.filter(
    field("title").is_null()
)

is_not_null

The is_not_null operator can be used to match documents that have a field with a value that is not null.
.filter(
    field("title").is_not_null()
)

gt

The gt operator can be used to match documents that have a field with a value greater than a specific value.
.filter(
    field("published_year") > 1997
)

# or

.filter(
    field("published_year").gt(1997)
)

gte

The gte operator can be used to match documents that have a field with a value greater than or equal to a specific value.
.filter(
    field("published_year") >= 1997
)

# or

.filter(
    field("published_year").gte(1997)
)

lt

The lt operator can be used to match documents that have a field with a value less than a specific value.
.filter(
    field("published_year") < 1997
)

# or

.filter(
    field("published_year").lt(1997)
)

lte

The lte operator can be used to match documents that have a field with a value less than or equal to a specific value.
.filter(
    field("published_year") <= 1997
)

# or

.filter(
    field("published_year").lte(1997)
)

starts_with

The starts_with operator can be used on string fields to match documents that start with a given prefix. This is especially useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id} and starts_with can then be used to scope the query to a specific tenant.
.filter(
    field("_id").starts_with("tenant_123/")
)

contains

The contains operator can be used on string fields to match documents that include a specific substring. It is case-sensitive and is particularly useful in scenarios where you need to filter results based on a portion of a string.
.filter(
    field("title").contains("Catcher")
)

match_all

The match_all operator returns true if all terms in the query are present in the field with a keyword index.
.filter(
  field("summary").match_all("love marriage england")
)
# you can also pass a list of strings:
.filter(
  field("summary").match_all(["love", "marriage", "england"])
)
When using a match_all operator against a text field, it must be used in conjunction with a keyword index defined in your collection schema.

match_any

The match_any operator returns true if any term in the query is present in the field with a keyword index.
.filter(
    field("summary").match_any("love ring")
)
# you can also pass a list of strings:
.filter(
  field("summary").match_any(["love", "ring"])
)
When using a match_any operator against a text field, it must be used in conjunction with a keyword index defined in your collection schema.

Mathematical operators

Mathematical operators perform computations on numbers.

add

The add operator can be used to add two numbers.
.filter(
    field("published_year") + 1997
)

# or

.filter(
    field("published_year").add(1997)
)

sub

The sub operator can be used to subtract two numbers.
.filter(
    field("published_year") - 1997
)

# or

.filter(
    field("published_year").sub(1997)
)

mul

The mul operator can be used to multiply two numbers.
.filter(
    field("published_year") * 1997
)

# or

.filter(
    field("published_year").mul(1997)
)

div

The div operator can be used to divide two numbers.
.filter(
    field("published_year") / 1997
)

# or

.filter(
    field("published_year").div(1997)
)

abs

The abs operator returns the absolute value of a number, which is useful for calculating distances or differences.
from topk_sdk.query import abs

# Find books published closest to 1990
select(
    delta=abs(field("published_year").sub(1990))
)

min

The min operator returns the smaller of two values, commonly used for clamping or setting upper bounds. It can work with both scalar values and other fields or expressions.
from topk_sdk.query import min

# Clamp BM25 scores to a maximum of 2.0
select(
    clamped_score=min(field("bm25_score"), 2.0)
)

# Take the lower of critic score vs user rating
select(
    conservative_score=min(field("critic_score"), field("user_rating"))
)

max

The max operator returns the larger of two values, commonly used for clamping or setting lower bounds. It can work with both scalar values and other fields or expressions.
from topk_sdk.query import max

# Ensure minimum relevance score of 1.5
select(
    boosted_score=max(field("relevance_score"), 1.5)
)

# Take the higher of critic score vs user rating
select(
    best_score=max(field("critic_score"), field("user_rating"))
)

ln

The ln operator calculates the natural logarithm, useful for logarithmic scaling and dampening large values.
# Apply logarithmic dampening to scores
select(
    log_score=(field("raw_score") + 1).ln()
)

exp

The exp operator calculates the exponential function (e^x), useful for exponential scaling and boosting.
# Apply exponential boosting to BM25 scores
select(
    boosted_score=(field("bm25_score") * 1.5).exp()
)

sqrt

The sqrt operator calculates the square root, useful for dampening values and creating non-linear transformations.
# Dampen large distance values
select(
    dampened_distance=field("vector_distance").sqrt()
)

square

The square operator multiplies a number by itself (x²), useful for amplifying differences and creating quadratic transformations.
# Create quadratic penalty for age differences
select(
    age_penalty=(field("user_age") - 50).square()
)

Collection

All queries must have a collection stage. Currently, we only support topk() and count() collectors.

topk

Use the topk() function to return the top k results. The topk() function accepts the following parameters:
expr
LogicalExpression
required
The logical expression to sort the results by.
k
number
required
The number of results to return.
asc
boolean
required
Whether to sort the results in ascending order.
To get the top 10 results with the highest title_similarity, you can use the following query:
.topk(field("title_similarity"), 10, asc=False)

count

Use the count() function to get the total number of documents matching the query. If there are no filters then count() will return the total number of documents in the collection.
# Count the total number of documents in the collection
.count()
When writing queries, remember that they all require the topk or count function at the end.

Rerank

The rerank() function is used to rerank the results of a query. Read more about it in our reranking guide.
.rerank()

LSN-based Consistency

TopK supports LSN (Log Sequence Number) based consistency for ensuring read-after-write consistency. When you perform a write operation (like upsert), you receive an LSN as a string that represents the sequence number of that write in the system’s log. You can use this LSN in subsequent queries to ensure that the query only returns results that are at least as recent as that write operation.

How it works

  1. Write operation: When you call lsn = client.collection().upsert(), you receive an LSN
  2. Query with LSN: Pass that LSN to client.collection().query(..., lsn=lsn)
  3. Consistency guarantee: If the write is not yet available in the read path, the query will be rejected and the client will automatically retry
This approach ensures that your queries always see the results of your recent writes, providing strong consistency guarantees when needed.
# Upsert a document and get the LSN
lsn = client.collection("books").upsert([
    {"_id": "1984", "title": "1984", "author": "George Orwell", "year": 1949}
])

# Query with LSN to ensure consistency
results = client.collection("books").query(
    select("title", "author", "year")
    .filter(field("author") == "George Orwell")
    .topk(field("year"), 10),
    lsn=lsn
)
Using LSN-based consistency may increase query latency as the system needs to verify that the specified LSN has been processed before returning results.