Skip to main content
All CollectionsGetting started
What makes for a good dataset?
What makes for a good dataset?

Let us look closer at good datasets for teaching AI models on

Ted Tigerschiöld avatar
Written by Ted Tigerschiöld
Updated over 2 years ago

In order to create an AI model with Labelf you are going to need a dataset to start with. Labelf can import CSV and XLSX data formats, you can also directly integrate your data from Zendesk.

The reason you need data is for the model to train on. Training a text classifier means that we show it labeled data. In order to train the model to become as good as possible we need to make sure that we have the best data we can get hold of. There are three factors that make a dataset good.

  • Relevance

  • Size

  • Accuracy in labeling

Relevance

What we mean with relevance is how similar the dataset is to the live data you will use the model on. If you are training a model to classify the sentiment of your twitter feed it is best to train it on your own twitter data.

Size

When it comes to training data it is good to get as much possible data while still making sure that the data is relevant. The reason why it is better to have more data is that you are more likely to help the model capture nuances in different texts by providing more data.

Accuracy in labeling

The model is trained by understanding the connection between the labels and the texts associated with them. This means that inconsistently labeled data will impair the model from learning how to label correctly. If you have a labeled data set but you are unsure about the accuracy of the labels it can be better to label the dataset on the Labelf platform

Did this answer your question?