Skip to main content

Improving Your Model

Everything you need to improve your classification model to fit your needs

Ted Tigerschiöld avatar
Written by Ted Tigerschiöld
Updated today

After creating a classification model, you can deploy it immediately based on the task description and label definitions you provided during setup. But if you want to boost accuracy, Labelf offers two optional paths: prompt tuning and annotation-based fine-tuning. This article explains both.

The three paths to deployment

When you create a model, you describe your classification task and define labels using natural language. That alone is enough to get a working model. From there, you can choose how much further to invest in accuracy:

Path 1: Deploy from instructions — Use the model as-is, based on your task description and label definitions. This is the fastest path and works well when your labels are clear-cut and the classification task is straightforward.

Path 2: Prompt tuning — Refine your task description and label definitions by testing them against a small set of manually validated examples. Labelf includes an AI-powered Analyze & Optimize feature that can suggest improvements automatically. This iterative process helps you tighten the instructions without requiring a large training dataset.

Path 3: Full fine-tuning — Annotate a training set of conversations with the correct labels, then fine-tune the model on that data. This typically produces the highest accuracy but requires the most effort.

You can move through these paths incrementally — start with instructions-only, then add prompt tuning, then fine-tune if needed. Each step is optional.

Path 1: Deploying from instructions

If your classification task is well-defined and your label descriptions are detailed, the model can perform well based on instructions alone. The key is writing high-quality descriptions:

  • Task description — Be specific about what you want the model to evaluate. "Determine whether the customer's issue was resolved" is better than "Classify the conversation."

  • Label descriptions — Explain each label in enough detail that someone unfamiliar with your business could apply it correctly. Include examples of what qualifies and what doesn't.

After deploying, review the model's predictions in Search to see how it's performing. If accuracy is good enough for your needs, you're done. If not, move on to prompt tuning.

Path 2: Prompt tuning

Prompt tuning is the process of iterating on your task description and label definitions to improve how the model classifies conversations. It's like editing a set of instructions until they produce the results you want.

The prompt tuning workflow

  1. Build a validation set — Click Start Labeling on your model's Overview page and begin annotating conversations. Labelf automatically reserves a portion of your labeled data as a validation set, so you don't need to configure anything — just start labeling. Once you have enough validated items, the model can measure its predictions against your labels.

  2. Review your model's performance — Open the Metrics tab on your model page to see how the model's predictions compare to your manual labels. Check accuracy, precision, recall, and the confusion matrix to identify where the model struggles.

  3. Refine your descriptions — Go to the Task Settings tab, then the Configuration sub-tab. Here you can edit your task description and label definitions directly. For example, if the model confuses two labels, add more detail to those label descriptions explaining the distinction.

  4. Use Analyze & Optimize — Instead of manually rewriting everything, click the Analyze & Optimize button to let AI suggest improvements based on your model's current performance (see below for details).

  5. Save and re-evaluate — After updating your descriptions, click Save. The model re-evaluates its predictions on the labeled items using the new instructions. Check the Metrics tab again to see if accuracy improved.

  6. Repeat — Continue this cycle until you're satisfied with the results. Each iteration is saved in the History sub-tab so you can compare versions.

Analyze & Optimize

Labelf includes an AI-powered optimization feature that can automatically suggest improvements to your task description and label definitions. You'll find it in the Task Settings tab under the AI-Powered Task Optimization section.

When you click Analyze & Optimize, the system runs through several steps:

  • Analyzing your model's current performance metrics

  • Generating optimization suggestions based on where the model struggles

  • Refining the task description to clarify classification criteria

  • Adding deep analysis and examples for common misclassifications

Once the analysis is complete, Labelf presents Optimization Suggestions for each component of your model — the task description and each label definition. For every suggestion, you see three things:

  • Current — Your existing description

  • Proposed — The AI's suggested improvement

  • Reasoning — An explanation of why the change was suggested, often referencing specific performance issues (for example, "Low precision suggests the model is over-tagging general dissatisfaction")

You can Apply individual suggestions, Edit them before applying, or click Apply All Suggestions to accept everything at once. If you don't want to use the suggestions, click Dismiss.

This feature is especially powerful when you're not sure why the model is making certain mistakes — the AI analyzes the confusion patterns and proposes targeted fixes.

Tracking your prompt tuning history

Every time you save changes to your task description or label definitions, the update is recorded in the History sub-tab within Task Settings. Each version shows:

  • The date and time of the change

  • A "Prompt tuned" tag

  • A snippet of the task description used

  • The label names at that point

  • Validation metrics (accuracy, F1 score, precision, recall)

This history lets you compare how different versions of your descriptions performed, so you can see whether your changes are improving the model. The current active version is highlighted with a Current version badge.

You can switch between List and Cards view, filter by metric set (Training or Validation), and click View Details to inspect any previous version.

Tips for effective prompt tuning

  • Focus on the label pairs that the model confuses most often — the confusion matrix on the Metrics tab shows you exactly which ones

  • Add negative examples to label descriptions ("This label does NOT apply when...")

  • Be explicit about edge cases in the task description

  • If a label catches too many false positives, narrow its description; if it misses too many true positives, broaden it

  • Use Analyze & Optimize as a starting point, then manually refine the suggestions further

Path 3: Annotation and fine-tuning

For the highest accuracy, you can annotate a dedicated training set and fine-tune the model on your labeled data. Fine-tuning teaches the model the specific patterns in your conversations, going beyond what instructions alone can capture.

Building a training set

To fine-tune, you need a training set — a collection of conversations that have been manually labeled with the correct classification. The more diverse and representative this set is, the better the fine-tuned model will perform.

When building your training set:

  • Include examples for every label, including edge cases and ambiguous conversations

  • Cover different conversation styles, lengths, channels, and time periods

  • Aim for a balanced distribution across labels when possible

  • Label consistently — if working with a team, make sure everyone agrees on the criteria

You can track your labeling progress on the model's Overview page, which shows the total number of labeled items, the training progress percentage, and your personal labeling activity.

The labeling view

Click Start Labeling on your model's Overview page to open the labeling view. This is the primary place to annotate conversations — it's a purpose-built interface designed to make labeling fast and accurate.

For each conversation, the labeling view shows:

  • Summary — An AI-generated summary of the conversation, so you can quickly understand what happened without reading the full transcript

  • Reasoning — A numbered list of AI-generated reasoning points explaining why the model classified the conversation the way it did. Each point has a View why button that opens the actual conversation messages supporting that reasoning

  • Show Conversation — A toggle to expand the full conversation transcript when you need more context

  • Label panel — The available labels with their full descriptions on the right side. Select a label and click Submit Label, or use keyboard shortcuts (1, 2, etc.) for speed. Click Skip to move on without labeling

The toolbar provides several useful features:

  • Search — Find specific conversations directly within the labeling view, so you can target particular topics or edge cases without leaving the annotation flow

  • Navigation — Use the arrow buttons to move between conversations, or click New to load a fresh conversation

  • Undo — Reverse your last labeling action if you make a mistake

  • Filter — Narrow down which conversations appear in the labeling queue

You can switch between single and random modes to control how conversations are presented — single mode works through them sequentially, while random mode picks conversations at random for a more diverse labeling experience.

You can also review all previously labeled items in the Labeling History tab on the model page. This tab shows each labeled conversation with its text preview, assigned label, and actions. You can change labels from this view if you spot a mistake, and use the Filter by label dropdown to focus on specific categories.

Fine-tuning the model

Once you have a labeled training set, the model can be fine-tuned on that data. Fine-tuning retrains the underlying language model to learn the specific patterns in your conversations, which typically produces a meaningful accuracy improvement over instructions alone.

Choosing the right approach

The right path depends on your use case and how much accuracy you need:

Deploy from instructions works best when your labels are clear-cut and you need results quickly. Good for initial exploration and low-stakes classification tasks.

Prompt tuning is the sweet spot for most teams. It requires minimal annotation effort but can significantly improve accuracy. Start here if your initial deployment shows room for improvement. The Analyze & Optimize feature makes this process faster by suggesting targeted improvements.

Fine-tuning delivers the best results for complex classification tasks where the distinction between labels is subtle or domain-specific. Worth the investment when accuracy directly impacts business decisions.

Iterating over time

Model improvement isn't a one-time effort. As your data evolves — new products, changing customer behavior, seasonal patterns — revisit your model periodically:

  • Review predictions in Search to spot emerging accuracy issues

  • Use Analyze & Optimize to get fresh suggestions based on current performance

  • Update your task description and label definitions to reflect changes

  • Add new training examples for patterns the model hasn't seen before

  • Retrain when you notice performance declining for specific categories

You can monitor changes over time in the Activity tab on your model page, which shows a timeline of status changes, renames, and other events.

Next steps

Need help?

If you need guidance on which improvement path to take or help with prompt tuning strategy, reach out to us at support@labelf.ai or use the chat widget in the bottom-right corner of the screen.

Did this answer your question?