Skip to main content

Understanding Model Metrics

Diving into model performance metrics and how to use them to evalute your models

Ted Tigerschiöld avatar
Written by Ted Tigerschiöld
Updated today

After deploying or fine-tuning a model, you need to know how well it performs. Labelf provides metrics that help you evaluate your model's accuracy, identify areas for improvement, and decide whether additional prompt tuning or fine-tuning is needed. This article explains what those metrics mean and how to interpret them.

Where to find model metrics

Model metrics are available on the Metrics tab of your model's detail page. To get there, navigate to the Models page and open the model you want to evaluate.

The Metrics tab shows four key performance gauges at the top — Accuracy, F1 Score, Precision, and Recall — followed by an interactive confusion matrix and a detailed per-label breakdown.

You can toggle between Training Set and Validation Set metrics using the radio buttons at the top. The validation set gives you the most realistic picture of how the model will perform on new, unseen conversations. There's also a Confidence Threshold slider that lets you filter predictions by confidence level, and a Refresh Metrics button to recalculate after changes.

You can also see a summary of your model's performance on the Overview tab, which displays the same four metrics with a toggle between Training and Validation views.

Key metrics explained

Accuracy

Accuracy tells you the percentage of conversations that the model classified correctly out of all conversations it was tested on. For example, an accuracy of 85% means the model got 85 out of every 100 predictions right.

Accuracy gives you a quick overall picture, but it can be misleading if your labels are imbalanced. A model that always predicts the most common label could have a high accuracy even though it never correctly identifies rare categories. That's why you should also look at per-label metrics.

Precision

Precision answers the question: "When the model predicts a specific label, how often is it correct?"

A precision of 90% for "Billing Issue" means that when the model predicts a conversation is about billing, it's correct 90% of the time. The other 10% are false positives — conversations that were labeled as billing but actually belong to a different category.

High precision means fewer false alarms. This matters when an automated workflow triggers based on the model's output, or when you're using classifications to make business decisions.

Recall

Recall answers the question: "Of all the conversations that actually belong to a label, how many did the model find?"

A recall of 80% for "Billing Issue" means the model correctly identified 80% of all actual billing conversations. The other 20% were missed — the model classified them as something else.

High recall means the model catches most of the relevant conversations. This matters when you can't afford to miss anything — for example, when monitoring for churn risk or compliance issues.

F1 Score

The F1 score is the harmonic mean of precision and recall. It provides a single number that balances both metrics. An F1 score of 1.0 is perfect, while 0.0 is the worst possible.

F1 is especially useful when you need a balanced view of performance and when your labels have different numbers of examples.

Reading the confusion matrix

The confusion matrix is displayed below the performance gauges on the Metrics tab. It's an interactive table that shows how the model's predictions compare to the actual labels. Each row represents the actual (true) label, and each column represents the predicted label.

  • Diagonal cells (top-left to bottom-right) show correct predictions — these are highlighted in green

  • Off-diagonal cells show errors — where the model confused one label for another — these are highlighted in yellow/red

For example, if the cell at row "Resolved" and column "Partially Resolved" shows 15, it means the model predicted "Partially Resolved" for 15 conversations that were actually "Resolved."

You can click on any cell in the confusion matrix to view the actual conversations behind that number. This is invaluable for understanding why the model is making mistakes.

Below the confusion matrix, you'll see summary statistics: Correct Predictions, Total Predictions, Accuracy, and Classes (number of labels).

📸 SCREENSHOT: confusion-matrix.png The interactive confusion matrix showing Predicted Labels (columns) vs Actual Labels (rows). Show correct predictions on the diagonal in green and misclassifications in yellow/red. Include the summary stats below: Correct Predictions, Total Predictions, Accuracy, Classes.

Detailed performance metrics

Below the confusion matrix, the Detailed Performance Metrics table gives you a per-label breakdown with the following columns:

  • Label — The label name

  • Precision — How often the model is correct when it predicts this label

  • Recall — How many actual instances of this label the model finds

  • F1 Score — The balanced measure of precision and recall for this label

  • Accuracy — Overall accuracy for this label

  • Support — The number of test examples for this label

  • TP (True Positives) — Correctly predicted as this label

  • FP (False Positives) — Incorrectly predicted as this label

  • TN (True Negatives) — Correctly identified as not this label

  • FN (False Negatives) — Missed instances of this label

This table helps you spot which labels are performing well and which need attention.

Using the Confidence Threshold

The Confidence Threshold slider at the top of the Metrics tab lets you see how your model performs at different levels of prediction certainty. Every time the model classifies a conversation, it assigns a confidence score — a number reflecting how sure it is about its prediction. By sliding the threshold to the right, you filter out lower-confidence predictions and only see metrics for the ones the model was most sure about.

For example, at a low threshold you see performance across all predictions. As you slide the threshold higher, the model's accuracy and precision will typically improve — because you're only looking at predictions where it was confident — but coverage drops, since more predictions fall below the cutoff and are excluded. This is useful for understanding the tradeoff between coverage and accuracy. If your workflow can tolerate leaving some conversations unclassified, setting a higher confidence threshold lets you act only on the predictions the model is most certain about, reducing errors on the ones that matter. If you need every conversation classified, keep the threshold low and focus on improving overall accuracy through prompt tuning or fine-tuning instead.

Interpreting your results

Good performance indicators

  • Overall accuracy above 80% for most business use cases

  • Balanced precision and recall across labels

  • The confusion matrix shows most values concentrated on the diagonal

  • F1 scores above 0.7 for all labels

Warning signs

  • One label dominates errors — If the model consistently confuses a specific pair of labels, those label descriptions may be too similar. Revise them to make the distinction clearer.

  • Very low recall on a label — The model is missing most conversations in this category. The label description may be too narrow, or the model needs more training examples for this label.

  • Very low precision on a label — The model is overusing this label. The description may be too broad — tighten it to reduce false positives.

  • Large gap between accuracy and F1 — This often indicates class imbalance. The model may perform well on common labels but poorly on rare ones.

What to do based on your metrics

If accuracy is low overall

  • Review your task description — is it specific enough?

  • Check whether label descriptions are detailed and unambiguous

  • Try prompt tuning: use Analyze & Optimize in Task Settings to get AI-powered suggestions, or manually refine descriptions and test against a validation set

  • If prompt tuning plateaus, consider annotating a training set for fine-tuning

If two labels are frequently confused

  • Click the relevant cell in the confusion matrix to review the misclassified conversations

  • Compare the label descriptions — is the distinction clear?

  • Add explicit guidance about what separates them (e.g., "This label applies when X, NOT when Y")

  • Consider merging the labels if the distinction isn't meaningful for your use case

If a specific label performs poorly

  • Rewrite the label description with more detail and examples

  • Add negative examples ("This does NOT include...")

  • If using fine-tuning, add more training examples for that label

  • Check the Support column — if support is very low, the label may be underrepresented in your data

Training vs. validation metrics

The Metrics tab lets you switch between Training Set and Validation Set views. Understanding the difference is important:

Training Set metrics show how well the model performs on data it was trained on. These numbers tend to be optimistic — the model has already seen these conversations.

Validation Set metrics show performance on held-out data that the model didn't see during training. These are a more realistic indicator of how the model will perform on new conversations.

If there's a large gap between training and validation metrics, the model may be overfitting — performing well on training data but struggling to generalize. This usually means you need more diverse training examples or simpler label definitions.

Monitoring performance over time

Model accuracy isn't static — it can change as your data evolves. Customer issues shift, new products launch, and the language your customers use changes over time.

Regularly review your model's predictions in Search and watch for emerging patterns that the model might not handle well. When you notice a decline in accuracy for certain categories, it's time to revisit your task description, label definitions, or training data. The prompt tuning History tab in Task Settings lets you track how your metrics have changed across iterations.

Next steps

Need help?

If you need help interpreting your model metrics or developing a strategy to improve accuracy, reach out to us at support@labelf.ai or use the chat widget in the bottom-right corner of the screen.

Did this answer your question?