Models

List your deployed models, fetch details including labels, accuracy metrics, and configuration. Models are the core of Labelf — everything from zero-shot prototypes to production fine-tuned classifiers.

GET /v2/models

List all deployed models in your workspace. Returns model IDs, names, types, and status.

GET /v2/models/{model_id}

Get full details for a specific model — labels, type, training status, and accuracy metrics.

Model types

Labelf supports a progression of model types, from instant prototypes to production-grade classifiers. Each type builds on the previous, letting you start classifying immediately and improve accuracy as you gather data.

ZERO-SHOT

Describe and classify

Define your categories in plain text — the model starts classifying immediately with no training data. Ideal for prototyping: describe what "Billing complaint" or "Churn risk" means, deploy, and start getting predictions within minutes. Accuracy is typically 70-85% depending on task complexity.

FEW-SHOT

Active Learning

Label 50–200 examples per class and the model learns your specific domain. Labelf's Active Learning system recommends which examples to label next — it finds model weaknesses and edge cases so each labeled example has maximum impact. Accuracy typically reaches 85–93%.

FINE-TUNED

Custom model

Full custom model trained on your data. Learns domain-specific vocabulary, jargon, and patterns that generic models miss. A telecom fine-tuned model knows that "Hemma Bredband" is a product name, not a description. Highest accuracy (90–97%) and fastest inference latency.

LLM

Prompt-tuned

For generative tasks that go beyond classification: summarization, entity extraction, reasoning, and structured output. Uses large language models with custom prompts and guardrails. Ideal for extracting action items from calls, generating ticket summaries, or answering "why did the customer churn?"

Model evaluation

Every model in Labelf comes with built-in evaluation metrics. You see exactly how well your model performs, per class, before deploying to production.

Metric	What it tells you
Confusion matrix	Where the model gets confused — e.g. it mislabels "Billing" as "Cancellation" 12% of the time. Shows you exactly which categories overlap.
Precision	When the model says "Churn risk", how often is it right? High precision = fewer false alarms.
Recall	Of all actual churn-risk conversations, how many does the model catch? High recall = fewer missed cases.
F1 score	Harmonic mean of precision and recall. The single number that tells you overall per-class performance.
Confidence threshold	Tune the cutoff per model. Higher threshold = more precise but fewer predictions. Lower = broader coverage but more noise. Labelf shows how each threshold affects your metrics in real time.

Response example

A model object includes its configuration, label set, training type, deployment status, and accuracy metrics.

{
  "id": 42,
  "name": "Contact Reason v3",
  "type": "fine-tuned",
  "status": "deployed",
  "labels": [
    "Billing",
    "Technical",
    "Cancellation",
    "Upgrade",
    "General inquiry",
    "Complaint"
  ],
  "metrics": {
    "accuracy": 0.94,
    "f1_macro": 0.92,
    "per_class": {
      "Billing":     { "precision": 0.96, "recall": 0.93, "f1": 0.94 },
      "Technical":    { "precision": 0.91, "recall": 0.95, "f1": 0.93 },
      "Cancellation": { "precision": 0.93, "recall": 0.89, "f1": 0.91 },
      "Upgrade":      { "precision": 0.95, "recall": 0.92, "f1": 0.93 },
      "General inquiry": { "precision": 0.88, "recall": 0.91, "f1": 0.89 },
      "Complaint":    { "precision": 0.90, "recall": 0.87, "f1": 0.88 }
    }
  },
  "training_examples": 4280,
  "last_trained": "2026-03-15T09:14:00Z",
  "confidence_threshold": 0.65
}

Active Learning

Labeling data is expensive. Active Learning makes every labeled example count by recommending which examples to label next. Instead of randomly sampling from your dataset, the system:

Finds model weaknesses — surfaces examples where the model is least confident, targeting the decision boundaries between confusable classes
Samples for diversity — ensures you label examples from different clusters, not just the same type of edge case over and over
Surfaces rare classes — actively seeks out underrepresented categories that would otherwise take thousands of random samples to find
Prioritizes impact — ranks examples by expected accuracy gain so each labeling session moves the needle as much as possible

In practice, a skilled annotator can label 200–400 examples per hour using the Labelf UI. With Active Learning, 200 well-chosen examples often outperform 2,000 randomly labeled ones. This means you can go from zero-shot prototype to production-grade model in a single afternoon.

Full lifecycle API — Available on request

The read-only model API documented above is available to all customers. The full lifecycle API — programmatic model creation, training, deployment, retraining, and evaluation — is available to enterprise customers.

Create models via API

Zero-shot, few-shot, fine-tuning, prompt-tuned

Programmatic deploy and retrain

Active Learning recommendations

Confusion matrix and per-class metrics

Confidence threshold tuning

Model versioning and rollback

Training job webhooks

Talk to us →

← Text Similarity Datasets →