Multi-Model Inference

Run multiple models against the same texts in a single request. Every model executes in parallel — total latency is roughly the same as a single model call. This is the backbone of hierarchical classification.

Real-world scenario

A customer calls about their mobile broadband bill. From a single transcript, you need to know: which product? what issue? what root cause? what sentiment? With multi-model inference, you run 4 models in one request and get all four answers back in the time it takes to run one.

Product

Mobile Broadband

Issue type

Billing

Root cause

Incorrect charge

Sentiment

Frustrated

Hierarchical classification

Most real-world use cases require multiple classification dimensions. A single customer interaction carries signal about the product, the issue, the root cause, and the customer's emotional state. Instead of chaining sequential API calls, multi-model inference lets you define all models upfront and get every dimension classified in a single round-trip.

Each model in the model_settings array is independent — it sees the same input texts but applies its own labels and confidence thresholds. You control how many predictions each model returns and which labels to filter for.

Request body

Parameter	Type	Description
`texts`	string[]	Required Texts to classify. Max 8 per request.
`model_settings`	object[]	Required Array of model configurations. Each object accepts: `model_id` (integer, required) — the deployed model to run `max_predictions` (integer, optional) — cap the number of labels returned per text. Omit for all labels. `label_filter` (string[], optional) — only return scores for these specific labels. Useful when you only care about a subset.

Per-model settings

Each model in the array can be configured independently. This is important because different classification tasks have different requirements:

max_predictions

Controls how many labels are returned per text. Set to 1 for single-label tasks like sentiment, or 3 for multi-label tasks where a conversation might touch several topics. Omit to get all labels with their scores.

label_filter

Restricts the response to specific labels. A sentiment model might have 5 labels (Very positive, Positive, Neutral, Negative, Very negative), but your workflow only needs Positive vs. Negative. Filtering reduces payload size and simplifies downstream logic.

Request example

Classify a single transcript across four models — product, issue type, root cause, and sentiment — in one call.

POST /v2/models/inference
Authorization: Bearer your-api-key
Content-Type: application/json

{
  "texts": [
    "Hi, I'm calling about my mobile broadband. I was charged 349 kr this month but my plan is supposed to be 199 kr. This has happened two months in a row and I'm really frustrated. I've been a customer for six years and I'm starting to think about switching."
  ],
  "model_settings": [
    {
      "model_id": 42,
      "max_predictions": 1
    },
    {
      "model_id": 87,
      "max_predictions": 2
    },
    {
      "model_id": 156,
      "max_predictions": 3
    },
    {
      "model_id": 201,
      "label_filter": ["Frustrated", "Angry", "Neutral", "Satisfied"]
    }
  ]
}

Model ID	Name	Purpose	Labels (subset)
42	Product	Which product is the customer calling about?	Mobile Broadband, Fixed Line, TV, ...
87	Issue type	What category of issue is raised?	Billing, Technical, Cancellation, Upgrade, ...
156	Root cause	What caused the issue?	Incorrect charge, System error, Policy change, ...
201	Sentiment	How does the customer feel?	Frustrated, Angry, Neutral, Satisfied

Response

The response groups results by model. Each model returns an array of predictions per input text, sorted by confidence descending.

{
  "results": [
    {
      "model_id": 42,
      "model_name": "Product",
      "predictions": [
        [
          { "label": "Mobile Broadband", "score": 0.94 }
        ]
      ]
    },
    {
      "model_id": 87,
      "model_name": "Issue Type",
      "predictions": [
        [
          { "label": "Billing", "score": 0.91 },
          { "label": "Cancellation", "score": 0.38 }
        ]
      ]
    },
    {
      "model_id": 156,
      "model_name": "Root Cause",
      "predictions": [
        [
          { "label": "Incorrect charge", "score": 0.88 },
          { "label": "Recurring billing error", "score": 0.72 },
          { "label": "Price plan mismatch", "score": 0.45 }
        ]
      ]
    },
    {
      "model_id": 201,
      "model_name": "Sentiment",
      "predictions": [
        [
          { "label": "Frustrated", "score": 0.82 },
          { "label": "Angry", "score": 0.14 },
          { "label": "Neutral", "score": 0.03 },
          { "label": "Satisfied", "score": 0.01 }
        ]
      ]
    }
  ]
}

Performance

⚡

All models run in parallel

Total latency is approximately equal to the slowest individual model, not the sum of all models. Running 4 models takes roughly the same time as running 1. This makes multi-model inference the recommended approach for any production pipeline that needs multiple classification dimensions.

When to use multi-model

Contact center analysis — Classify every call by product, issue, root cause, sentiment, and churn risk in one pass. Feed results into dashboards and retention workflows.

Chat/ticket triage — Route incoming tickets to the right team by combining urgency, topic, and language detection in a single request.

Quality assurance — Score agent performance across multiple dimensions (empathy, resolution, compliance) from one transcript.

Batch processing — Classify up to 8 conversations across all your models per request. For higher throughput, send requests concurrently.

← Classification Text Similarity →