TopK allows you to query documents using a SQL-inspired syntax. It comes with semantic search, text search, vector search, and filtering capabilities out of the box.

TopK’s declarative query builder allows you to select fields, chain filters, apply vector/text search in a composable way.

To perform a semantic search on your documents, use the semantic_similarity function. Query you want to search for is automatically embedded using the model specified in the semantic_index definition.

from topk.query import select, fn, match

docs = client.collection("books").query(
    select(
        "title",
        title_similarity=fn.semantic_similarity("title", "catcher")
    )
    .top_k(field("title_similarity"), 10)
)

To perform a text search on your documents, use the match function. The match function will by default execute against all fields with a keyword_index defined on them. You can provide field parameter to specify a field to match against.

To use the match predicate, the field must have a keyword_index defined on it.

from topk.query import select, fn, match

docs = client.collection("books").query(
    select(
        "title",
        # Score documents using BM25 algorithm
        text_score=fn.bm25_score()
    )
    # Filter to documents that have the `great` keyword in the `title` field
    # or the `catcher` in any of the text-indexed fields.
    .filter(match("great", field="title") | match("catcher"))
    # Return top 10 documents with the highest text score
    .top_k(field("text_score"), 10)
)

The above example takes a collection of books and filters to documents that have the Great keyword in the title field.

To perform a vector search on your documents, use the vector_distance method. This computes the distance between the provided query vector and vectors stored inside the database according to the metric specified in the vector_index definition.

from topk.query import select, fn

docs = client.collection("books").query(
    select(
        "title",
        # Select supports arbitrary expressions. For example, you can add 10 years to the `published_year` field.
        published_year_plus_10=field("published_year") + 10,
        # Compute vector distance between the provided query vector and the `title_embedding` field.
        title_similarity=fn.vector_distance("title_embedding", [0.1, 0.2, 0.3, ...])
    )
    # Return top 10 results with the highest similarity
    .top_k(field("title_similarity"), 10)
)

The above example computes the cosine similarity between the vector and the title_embedding field. The title_embedding field is a vector field that represents the title of the book.

Filtering

You can filter documents based on metadata, keywords, values computed inside select() (e.g. vector similarity), and more. Filter expressions support all comparison operators: ==, !=, >, >=, <, <=, arithmetic operations: +, -, *, /, and boolean operators: | and &.

Metadata filtering

The example below shows filtering by metadata.

.filter(
    field("published_year") > 1980
)

Keyword filtering

The example below shows filtering documents by the keywords they contain. It’ll match all documents that contain either gatsby or both catcher and rye.

.filter(
    match("gatsby") | (match("catcher") & match("rye"))
)

Filter expressions

You can combine multiple filter expressions using boolean operators. For example, the query below will match all documents that were published in 1997 or between 1920 and 1980.

.filter(
    (field("published_year") == 1997) | ((field("published_year") >= 1920) & (field("published_year") <= 1980))
)

starts_with

The starts_with operator can be used on string fields to match documents that start with a given prefix. This is especially useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id} and starts_with can then be used to scope the query to a specific tenant.

.filter(
    field("_id").starts_with("tenant_123/")
)

Results collection

All queries are required to have a collection stage. Currently, we only support top_k and count collectors.

top_k

You can use the top_k method to return the top k results. The top_k method takes a field, the number of results to return, and an optional asc parameter to sort the results in ascending order.

# Return top 10 results order by `published_year` ascending
.top_k(field("title_similarity"), 10, asc=True)

count

You can use the count method to get the total number of documents matching the query. If there are no filters then count will return the total number of documents in the collection.

# Count the total number of documents in the collection
.count()

Reranking

You can rerank the results of a query using the rerank method. The rerank method takes:

  • model (optional): The model to use for reranking. If not specified, uses the default model.
  • query (optional): The query text to rerank against. Uses arguments from semantic_similarity if not specified.
  • fields (optional): List of fields to use for reranking. Uses fields from semantic_similarity if not specified.
  • topk_multiple (optional): Multiple of top-k to rerank. For example, if top_k=10 and topk_multiple=2, reranker takes 20 results from the original query and returns the top 10 results.

Supported models:

  • cohere/rerank-v3.5
.rerank()

# or 

.rerank(
    model="cohere/rerank-v3.5",
    query="catcher",
    fields=["title", "description"],
    topk_multiple=2
)

Putting it all together

You can combine text search, vector search, and filtering in a single query. Some people refer to this as a hybrid search and we believe it can provide you with the best and most relevant search results.

from topk.query import select, fn, match

docs = client.collection("books").query(
    select(
        "title",
        # Score documents using BM25 algorithm
        text_score=fn.bm25_score()
        # Compute semantic similarity between the provided query and the `title` field.
        title_similarity=fn.semantic_similarity("title", "catcher")
    )
    # Filter to documents that contain the `great` keyword
    .filter(match("great"))
    # Filtering by metadata
    .filter(field("published_year") > 1980)
    # Return top 10 documents with the highest combined score
    .top_k(field("text_score") * 0.2 + field("title_similarity") * 0.8, 10)
    # Rerank the documents
    .rerank()
)

Notes on writing queries

When writing queries, remember that they all require the top_k or count method at the end.

Next steps

Need support or want to give some feedback? You can join our community or drop us an email at support@topk.io.