Unified retrieval allows you to combine multiple retrieval techniques—such as multi-vector search, hybrid (vector + text) search, and custom scoring functions within a single query. This flexibility enables more precise and effective ranking of search results.

With TopK, you can:

  1. Retrieve documents based on multiple embeddings — multi-vector retrieval.
  2. Combine semantic similarity (or vector distance) with keyword search — hybrid retrieval.
  3. Apply custom scoring functions that blend multiple ranking factors — custom scoring.

Multi-Vector Retrieval

In some cases, a single vector representation of a document isn’t enough. For example, in a research paper database, you might want to:

  • Retrieve documents based on both a summary of the entire paper and a summary of individual paragraphs.
  • Rank results by combining scores from multiple embeddings.

Defining a Collection with Multiple Embeddings

from topk_sdk.schema import text, vector, semantic_index

client.collections().create(
    "papers",
    schema={
        "title": text().required(),
        "paper_summary": text().index(semantic_index()),  # Embedding of full paper
        "paragraph_summary": text().index(semantic_index()),  # Embedding of a paragraph
    },
)

Querying with Multiple Embeddings

from topk.query import select, fn

docs = client.collection("papers").query(
    select(
        "title",
        paper_score=fn.semantic_similarity("paper_summary", "deep learning optimization"),
        paragraph_score=fn.semantic_similarity("paragraph_summary", "stochastic gradient descent")
    )
    .top_k(field("paper_score") * 0.7 + field("paragraph_score") * 0.3, 10)
)

Explanation

  • We retrieve documents based on both the full paper summary and the paragraph summary.
  • The top_k() function blends the two scores, giving 70% weight to the full paper and 30% weight to the paragraph.
  • This method ensures that entirely relevant papers rank higher while also considering specific paragraph relevance.

Hybrid retrieval combines semantic similarity (vector-based search) with exact keyword matching. This ensures that documents containing explicit matches to the query keywords are considered alongside semantic similarity.

Defining a Collection for Hybrid Retrieval

from topk_sdk.schema import text, semantic_index, keyword_index

client.collections().create(
    "articles",
    schema={
        "title": text().required().index(keyword_index()),  # Keyword-based retrieval
        "content": text().index(semantic_index()),  # Semantic search
    },
)

Querying with Hybrid Retrieval

from topk.query import select, fn, match

docs = client.collection("articles").query(
    select(
        "title",
        content_similarity=fn.semantic_similarity("content", "climate change policies"),
        text_score=fn.bm25_score()
    )
    .filter(match("carbon tax") | match("renewable energy"))  # Ensure keyword relevance
    .top_k(field("content_similarity") * 0.6 + field("text_score") * 0.4, 10)
)

Explanation

  • We retrieve documents based on semantic meaning (content_score) and keyword matching (keyword_match).
  • The filter() ensures that documents must contain at least one relevant keyword.
  • The top_k() function weights the scores, prioritizing semantic meaning (60%) while still considering keyword matches (40%).

This balances precision and recall, capturing both exact keyword matches and meaningful context.


Custom Scoring Functions

TopK allows you to define powerful scoring functions by combining semantic similarity with additional fields, such as a precomputed importance score.

Defining a Collection with Custom Scoring Fields

from topk_sdk.schema import text, float, semantic_index

client.collections().create(
    "documents",
    schema={
        "content": text().index(semantic_index()),  # Semantic search
        "importance": float().required(),  # Precomputed importance score
    },
)

Querying with a Custom Scoring Function

from topk.query import select, fn

docs = client.collection("documents").query(
    select(
        content_score=fn.semantic_similarity("content", "machine learning applications"),
    )
    .top_k(field("importance") * field("content_score"), 10)
)

Explanation

  • We retrieve documents based on both semantic similarity (content_score) and precomputed importance (importance_score).
  • The top_k() function gives 80% weight to content relevance and 20% weight to document importance.
  • This method allows you to boost more critical documents, ensuring that highly relevant but less “important” content doesn’t dominate.