Creating a collection

The topk_sdk.schema module contains the schema definition for the fields in a collection.

from topk_sdk.schema import int, text, vector, vector_index, keyword_index, semantic_index

To create a collection, you need to pass the collection name and the schema to the create method.

client.collections().create(
    "books",
    schema={
        "title": text().required().index(keyword_index()),
        "title_embedding": vector(1536)
            .required()
            .index(vector_index(metric="euclidean")),
        "published_year": int().required(),
    },
)

Schema

Data types

TopK supports the following data types. There are more on our roadmap and we are working effortlessly to bring them to you as soon as possible.

int()

int() is used to define an integer field in the schema.

"published_year": int()

float()

float() is used to define a float field in the schema.

"price": float()

bool()

bool() is used to define a boolean field in the schema.

"is_published": bool()

text()

text() is used to define a text field in the schema.

"title": text()

f32_vector()

f32_vector() is used to define a vector field with f32 values. You can pass vector dimension as a parameter (required, greater than 0) which will be validated when upserting documents.

"title_embedding": f32_vector(dimension=1536)

u8_vector()

u8_vector() is used to define a vector field with u8 values. You can pass vector dimension as a parameter (required, greater than 0) which will be validated when upserting documents.

"title_embedding": u8_vector(dimension=1536)

binary_vector()

binary_vector() is used to define a binary vector packed into u8 values. You can pass vector dimension as a parameter (required, greater than 0) which will be validated when upserting documents.

Binary vector dimension is defined in terms of the number of bytes. This means that for a 1024-bit binary vector, the dimension topk expects is 128 (1024 / 8).

"title_embedding": binary_vector(dimension=128)

bytes()

bytes() is used to define a bytes field in the schema.

"image": bytes()

Properties

required()

required() is used to mark a field as required. All fields are optional by default.

However, don’t forget that every document has to have an _id.
"title": text().required()

The above example shows how to mark a field title as required.

Methods

index()

index() is used to create an index on a field.

Semantic index

The semantic_index() method is used to create both a vector and a keyword index on a given field. This allows you to do both semantic search and keyword search over the same field. Note that semantic_index() can only be called over text() data type.

"title": text().index(semantic_index())

Optionally, you can pass a model parameter to the semantic_index() method. Supported models are:

  • cohere/embed-english-v3
  • cohere/embed-multilingual-v3 (default)

Vector index

The vector_index() method is used to create a vector index. Only fields with f32_vector(), u8_vector(), or binary_vector() data types can be indexed with a vector index.

"title_embedding": vector(dimension=1536).index(vector_index(metric="cosine"))

The above example shows how to create a vector index on a field title_embedding with the cosine similarity metric. However, there are more metrics available:

  • euclidean
  • cosine
  • dot_product
  • hamming (only supported for binary_vector() type)

Need support or want to give some feedback? You can join our community or drop us an email at support@topk.io.

Keyword index

The keyword_index() method is used to create a keyword index on a text() field. index(keyword_index()) can only be called over text() data type.

"title": text().index(keyword_index())