After deploying or fine-tuning a model, you need to know how well it performs. Labelf provides metrics that help you evaluate your model's accuracy, identify areas for improvement, and decide whether additional prompt tuning or fine-tuning is needed. This article explains what those metrics mean and how to interpret them.
Where to find model metrics
Model metrics are available on the Metrics tab of your model's detail page. To get there, navigate to the Models page and open the model you want to evaluate.
The Metrics tab shows four key performance gauges at the top — Accuracy, F1 Score, Precision, and Recall — followed by an interactive confusion matrix and a detailed per-label breakdown.
You can toggle between Training Set and Validation Set metrics using the radio buttons at the top. The validation set gives you the most realistic picture of how the model will perform on new, unseen conversations. There's also a Confidence Threshold slider that lets you filter predictions by confidence level, and a Refresh Metrics button to recalculate after changes.
You can also see a summary of your model's performance on the Overview tab, which displays the same four metrics with a toggle between Training and Validation views.
Key metrics explained
Accuracy
Accuracy tells you the percentage of conversations that the model classified correctly out of all conversations it was tested on. For example, an accuracy of 85% means the model got 85 out of every 100 predictions right.
Accuracy gives you a quick overall picture, but it can be misleading if your labels are imbalanced. A model that always predicts the most common label could have a high accuracy even though it never correctly identifies rare categories. That's why you should also look at per-label metrics.
Precision
Precision answers the question: "When the model predicts a specific label, how often is it correct?"
A precision of 90% for "Billing Issue" means that when the model predicts a conversation is about billing, it's correct 90% of the time. The other 10% are false positives — conversations that were labeled as billing but actually belong to a different category.
High precision means fewer false alarms. This matters when an automated workflow triggers based on the model's output, or when you're using classifications to make business decisions.
Recall
Recall answers the question: "Of all the conversations that actually belong to a label, how many did the model find?"
A recall of 80% for "Billing Issue" means the model correctly identified 80% of all actual billing conversations. The other 20% were missed — the model classified them as something else.
High recall means the model catches most of the relevant conversations. This matters when you can't afford to miss anything — for example, when monitoring for churn risk or compliance issues.
F1 Score
The F1 score is the harmonic mean of precision and recall. It provides a single number that balances both metrics. An F1 score of 1.0 is perfect, while 0.0 is the worst possible.
F1 is especially useful when you need a balanced view of performance and when your labels have different numbers of examples.
Reading the confusion matrix
The confusion matrix is displayed below the performance gauges on the Metrics tab. It's an interactive table that shows how the model's predictions compare to the actual labels. Each row represents the actual (true) label, and each column represents the predicted label.
Diagonal cells (top-left to bottom-right) show correct predictions — these are highlighted in green
Off-diagonal cells show errors — where the model confused one label for another — these are highlighted in yellow/red
For example, if the cell at row "Resolved" and column "Partially Resolved" shows 15, it means the model predicted "Partially Resolved" for 15 conversations that were actually "Resolved."
You can click on any cell in the confusion matrix to view the actual conversations behind that number. This is invaluable for understanding why the model is making mistakes.
Below the confusion matrix, you'll see summary statistics: Correct Predictions, Total Predictions, Accuracy, and Classes (number of labels).
📸 SCREENSHOT: confusion-matrix.png The interactive confusion matrix showing Predicted Labels (columns) vs Actual Labels (rows). Show correct predictions on the diagonal in green and misclassifications in yellow/red. Include the summary stats below: Correct Predictions, Total Predictions, Accuracy, Classes.
Detailed performance metrics
Below the confusion matrix, the Detailed Performance Metrics table gives you a per-label breakdown with the following columns:
Label — The label name
Precision — How often the model is correct when it predicts this label
Recall — How many actual instances of this label the model finds
F1 Score — The balanced measure of precision and recall for this label
Accuracy — Overall accuracy for this label
Support — The number of test examples for this label
TP (True Positives) — Correctly predicted as this label
FP (False Positives) — Incorrectly predicted as this label
TN (True Negatives) — Correctly identified as not this label
FN (False Negatives) — Missed instances of this label
This table helps you spot which labels are performing well and which need attention.
Using the Confidence Threshold
The Confidence Threshold slider at the top of the Metrics tab lets you see how your model performs at different levels of prediction certainty. Every time the model classifies a conversation, it assigns a confidence score — a number reflecting how sure it is about its prediction. By sliding the threshold to the right, you filter out lower-confidence predictions and only see metrics for the ones the model was most sure about.
For example, at a low threshold you see performance across all predictions. As you slide the threshold higher, the model's accuracy and precision will typically improve — because you're only looking at predictions where it was confident — but coverage drops, since more predictions fall below the cutoff and are excluded. This is useful for understanding the tradeoff between coverage and accuracy. If your workflow can tolerate leaving some conversations unclassified, setting a higher confidence threshold lets you act only on the predictions the model is most certain about, reducing errors on the ones that matter. If you need every conversation classified, keep the threshold low and focus on improving overall accuracy through prompt tuning or fine-tuning instead.
Interpreting your results
Good performance indicators
Overall accuracy above 80% for most business use cases
Balanced precision and recall across labels
The confusion matrix shows most values concentrated on the diagonal
F1 scores above 0.7 for all labels
Warning signs
One label dominates errors — If the model consistently confuses a specific pair of labels, those label descriptions may be too similar. Revise them to make the distinction clearer.
Very low recall on a label — The model is missing most conversations in this category. The label description may be too narrow, or the model needs more training examples for this label.
Very low precision on a label — The model is overusing this label. The description may be too broad — tighten it to reduce false positives.
Large gap between accuracy and F1 — This often indicates class imbalance. The model may perform well on common labels but poorly on rare ones.
What to do based on your metrics
If accuracy is low overall
Review your task description — is it specific enough?
Check whether label descriptions are detailed and unambiguous
Try prompt tuning: use Analyze & Optimize in Task Settings to get AI-powered suggestions, or manually refine descriptions and test against a validation set
If prompt tuning plateaus, consider annotating a training set for fine-tuning
If two labels are frequently confused
Click the relevant cell in the confusion matrix to review the misclassified conversations
Compare the label descriptions — is the distinction clear?
Add explicit guidance about what separates them (e.g., "This label applies when X, NOT when Y")
Consider merging the labels if the distinction isn't meaningful for your use case
If a specific label performs poorly
Rewrite the label description with more detail and examples
Add negative examples ("This does NOT include...")
If using fine-tuning, add more training examples for that label
Check the Support column — if support is very low, the label may be underrepresented in your data
Training vs. validation metrics
The Metrics tab lets you switch between Training Set and Validation Set views. Understanding the difference is important:
Training Set metrics show how well the model performs on data it was trained on. These numbers tend to be optimistic — the model has already seen these conversations.
Validation Set metrics show performance on held-out data that the model didn't see during training. These are a more realistic indicator of how the model will perform on new conversations.
If there's a large gap between training and validation metrics, the model may be overfitting — performing well on training data but struggling to generalize. This usually means you need more diverse training examples or simpler label definitions.
Monitoring performance over time
Model accuracy isn't static — it can change as your data evolves. Customer issues shift, new products launch, and the language your customers use changes over time.
Regularly review your model's predictions in Search and watch for emerging patterns that the model might not handle well. When you notice a decline in accuracy for certain categories, it's time to revisit your task description, label definitions, or training data. The prompt tuning History tab in Task Settings lets you track how your metrics have changed across iterations.
Next steps
Deploying Your Model — Understand the deployment lifecycle and using predictions
Improving Your Model — Go back to prompt tuning or fine-tuning to boost accuracy
Tips for Building Better Classifiers — Strategies for maximizing model accuracy
Need help?
If you need help interpreting your model metrics or developing a strategy to improve accuracy, reach out to us at support@labelf.ai or use the chat widget in the bottom-right corner of the screen.


