Collections organize your documents, define their schema, and enable fast vector search, filtering, keyword search, semantic search, and multi-vector search.
Creating a collection
In order to create a collection, call the create() method on the client.collections() object:
from topk_sdk.schema import int, text, semantic_index
client.collections().create(
"books",
schema={
"title": text().required().index(semantic_index()),
"published_year": int().required(),
},
)
Field names starting with _ are reserved for internal use.
Schema
Opt-in schema
TopK is schemaless-by-default. Fields without types can store any value. When types are specified, data is validated during upsert.
Indexed fields require explicit types.
Field types
| Type | Use case |
|---|
text() | Strings, descriptions, content, IDs |
bytes() | Binary data, images, files |
int() | Integers, counts, IDs |
float() | Decimal numbers, prices |
bool() | true/false values |
list(value_type) | Arrays of text, integer, or float elements |
f8_vector(dim) | 8-bit float embeddings |
f16_vector(dim) | 16-bit float embeddings |
f32_vector(dim) | Dense embeddings (most common) |
u8_vector(dim) | Quantized embeddings |
i8_vector(dim) | Signed quantized embeddings |
binary_vector(dim) | Binary embeddings |
f32_sparse_vector() | Sparse embeddings |
u8_sparse_vector() | Quantized sparse embeddings |
matrix(dim, value_type) | Multi-vector embeddings |
Required fields
Fields are optional by default.
Add required() to make them mandatory—required fields must be present in every document during upsert. Documents missing a required field are rejected with a validation error.
from topk_sdk.schema import int, text
schema = {
"name": text().required(), # Must be present in all documents
"price": int(), # Can be omitted (null)
}
Indexes
Only indexed fields can be searched. Non-indexed fields support exact-match filters only.
Vector Index
Used for vector search. Supports dimensions up to 2^14. Enabled by vector_index().
from topk_sdk.schema import f32_vector, vector_index
schema = {
"embedding": f32_vector(dimension=1536).index(vector_index(metric="cosine")),
}
Similarity metrics compatibility:
| Vector Type | cosine | euclidean | dot_product | hamming |
|---|
f8_vector | ✅ | ✅ | ✅ | — |
f16_vector | ✅ | ✅ | ✅ | — |
f32_vector | ✅ | ✅ | ✅ | — |
u8_vector | ✅ | ✅ | ✅ | — |
i8_vector | ✅ | ✅ | ✅ | — |
binary_vector | — | — | — | ✅ |
f32_sparse_vector | — | — | ✅ | — |
u8_sparse_vector | — | — | ✅ | — |
Multi Vector Index
Enables multi-vector search on matrix() fields using the maxsim metric for late-interaction scoring. Enabled by multi_vector_index(). See multi-vector search for more information.
from topk_sdk.schema import matrix, multi_vector_index
schema = {
"token_embeddings": matrix(
dimension=1536,
value_type="f32"
).index(
multi_vector_index(metric="maxsim")
),
}
Keyword Index
Traditional text search with BM25 relevance scoring. Fast keyword matching with no embedding overhead. Enabled by keyword_index().
from topk_sdk.schema import keyword_index, text
schema = {
"title": text().index(keyword_index()),
}
Semantic Index
Convenience method for automatic embeddings. Enabled by semantic_index().
from topk_sdk.schema import semantic_index, text
schema = {
"title": text().index(semantic_index()),
}
See semantic_index() for model details.