Text Similarity

Compare one set of texts against another by meaning. Every text you send is scored against every reference text, and the API returns the matches ranked by similarity — across languages, regardless of phrasing. Up to 200 texts per request, both sets combined.

How it works

You send two sets of texts in a single request. base_texts is the reference set — the pool you want to match against. compare_to_texts holds the texts you want matches for.

Two sets of texts

Both sets are objects of id → text pairs, so every score maps back to your own identifiers. The reference set can be anything: known churn complaints, last week's tickets, a handful of hand-picked examples.

Every pair is scored

Each compared text is scored against every reference text. Scores run 0 to 1 and are weighted toward meaning — texts are embedded with multilingual models — with a smaller weight on surface-level text overlap.

Ranked matches per text

The response lists, for every compared text, the reference texts ranked by similarity. Use top_n to cap how many matches come back per text.

Cross-language matching

Because similarity is scored on meaning, it works across languages. A text in English will match relevant texts in Swedish, German, Arabic, or any other language in your sets. This is particularly valuable for Nordic companies with multilingual customer bases — compare conversations across markets without translating anything first.

Example: The text "customer was promised a callback but never received one" matches a Swedish conversation containing "Jag blev lovad att någon skulle ringa tillbaka men ingen har hört av sig" — with a high similarity score — because the meaning is the same.

Request body

Parameter	Type	Description
`base_texts`	object	Required The reference set, as id → text pairs. Every compared text is scored against every text in this set.
`compare_to_texts`	object	Required The texts you want matches for, as id → text pairs. The response is keyed by these ids.
`top_n`	integer	Number of matches to return per compared text. Defaults to all reference texts, ranked by similarity.

A single request can hold up to 200 texts — base_texts and compare_to_texts combined. For larger jobs, batch your reference pool across multiple requests.

Request example

Find the conversations most similar to a known churn complaint. The reference pool holds five conversations in Swedish, English, and German; the compared text is in English.

POST /v2/similarity
Authorization: Bearer your-api-key
Content-Type: application/json

{
  "base_texts": {
    "conv_4821": "Jag har varit kund i tio år men nu får det vara nog. Priset har höjts tre gånger på två år.",
    "conv_4822": "Can you help me upgrade my plan to the 100 Mbit package?",
    "conv_4823": "I want to cancel my subscription. I've found a better deal elsewhere and your retention offer wasn't enough.",
    "conv_4824": "The internet has been dropping every evening for two weeks.",
    "conv_4825": "Ich bin seit fünf Jahren Kunde und die Preise steigen ständig. Ich denke über einen Wechsel nach."
  },
  "compare_to_texts": {
    "churn_example": "I've been a customer for 8 years but I'm seriously considering switching to another provider. The price keeps going up and support takes forever."
  },
  "top_n": 3
}

Response

The response is keyed by your compare_to_texts ids. Each id holds the reference texts ranked by similarity score (0 to 1), capped at top_n. Notice that the Swedish and German churn conversations rank highest despite being in different languages.

{
  "churn_example": [
    { "id": "conv_4823", "similarity": 0.91 },
    { "id": "conv_4821", "similarity": 0.87 },
    { "id": "conv_4825", "similarity": 0.84 }
  ]
}

Use cases

Find churn patterns

Score batches of recent conversations against a known churn complaint. The ranked matches show how widespread a specific complaint is — across languages, channels, and time periods — before it shows up in your churn metrics.

Detect duplicate tickets

Compare incoming tickets against the ones already open. A Swedish complaint and an English complaint about the same issue match by meaning — de-duplicate across languages and surface recurring issues that span multiple markets.

Match against known patterns

Describe a pattern in plain language: "customer was promised a callback but never received one". Score conversations against that description and the matches surface, regardless of how the customer or agent phrased it.

Build training sets for new models

Need examples of "broken promise" complaints to train a new classifier? Score a batch of conversations against a handful of known examples, review the top matches, and you have a curated training set in minutes instead of days.

← Multi-Model Models →