Text fields¶

Unlike other Lucene-based search engines, Nixiesearch has a distinction between singular and repeated fields on a schema level - so choose your field type wisely:

for singular fields, use the text type.
for repeated fields, choose the text[] type.

Example field schema for a text fields title and genre:

schema:
  movies:
    fields:
      title:
        type: text      # only a single title is allowed
        search:
          semantic:
            model: e5-small
      genre:
        type: text[]    # there can be multiple genres per document
        search: 
          lexical:
            analyze: english
        filter: true    # field is filterable
        facet: true     # field is facetable
        store: true     # can retrieve the field back from index
        suggest: true   # build autocomplete suggestions based on that field

Semantic index parameters¶

When a text field has a semantic search enabled, there are a couple of parameters you can further configure:

schema:
  movies:
    fields:
      title:
        type: text      # only a single title is allowed
        search:
          semantic:
            model: e5-small  # optional: for server-side inference
            dim: 384         # optional: for pre-embedded documents
            ef: 32
            m: 16
            quantize: float32
            workers: 4
            distance: dot

Fields:

model (optional): embedding model for server-side inference. Required when documents don't have pre-computed embeddings.
dim (optional): embedding vector dimensions for pre-embedded documents. Required when model is not specified.
ef and m: HNSW index parameters. The higher these values, the better the search recall at the cost of performance.
quantize (optional, float32/int8/int4/int1, default float32): index quantization level. int8 saves 4x RAM and disk but at the cost of worse recall.
workers (optional, int, default is same as number of CPUs in the system): how many background workers to use for HNSW indexing operations.
distance (optional, dot/cosine, default dot): which embedding distance function to use. dot is faster (and mathematically equals to cosine) if your embeddings are normalized (see embedding inference section for details)

Server-side vs Pre-embedded documents¶

Nixiesearch supports two modes for semantic search:

Server-side inference: Use the model parameter to compute embeddings on the server
Pre-embedded documents: Use the dim parameter when documents already contain embedding vectors

You must specify either model or dim, but not both.

Operations on text fields¶

Document ingestion format¶

When a document with a text field is ingested, Nixiesearch expects the document JSON payload for the field to be in either format:

JSON string: like {"title":"cookies"}, when text embedding is computed by the server
JSON obj: like {"title": {"text":"cookies", "embedding": [1,2,3]}} for pre-embedded documents.

See pre-embedded text fields in the Document format section for more details.

Search¶

The main reason of text fields existence is to be used in search. Nixiesearch has two types of indexes can be used for search, lexical and semantic:

lexical: an industry traditional BM25 keyword search, like in Elastic/SOLR before 2022. Nowadays called as sparse retrieval.
semantic: an a-kNN vector-based search over embeddings of documents. A.k.a dense retrieval.

By default all text fields are not searchable, and you need to explicitly enable either lexical, or semantic retrieval, or both at the same time:

schema:
  movies:
    fields:
      title:
        type: text
        search:
          semantic: # build an embedding HNSW index 
            model: e5-small
          lexical:  # build a lexical BM25 index
            analyze: english

After that you can search over text fields with all Query DSL operators Nixiesearch supports, for example match, semantic and rrf:

curl -XPOST http://localhost:8080/v1/index/movies/search \
  -H "Content-Type: application/json" \
  -d '{ 
    "query": {
      "rrf": {
        "queries": [
          {"match": {"title": "batman"}},
          {"semantic": {"title": "batman nolan"}}
        ],
        "rank_window_size": 20
      } 
    }, 
    "fields": ["title"], 
    "size": 5
  }'

See facets, filters and sorting sections for more details.

Suggestions¶

Text fields can also be used for creating autocomplete suggestions:

curl -XPOST -d '{"query": "h", "fields":["title"]}' http://localhost:8080/v1/index/<index-name>/suggest

The request above emits the following response:

{
  "suggestions": [
    {"text": "hugo", "score": 2.0},
    {"text": "hugo boss", "score": 1.0},
    {"text": "hugo boss red", "score": 1.0}
  ],
  "took": 11
}

See Autocomplete suggestions section for more details.

For further reading, check out how to define numeric fields in the index mapping.