I was reading about active learning recently and active learning seems to be used after the first model is generated. So, I was wandering if there are technics to choose what to label before generating the first model. I have here a text dataset with more than 1 million sentences. I want to create a binary classification with this, but I cannot label it all. So, I wanted to know if there is a way to select a smart sample out of this 1 million to label for the first model.
EDIT 1: Answers to questions in comment.
What are the sentences (data)?
It is user complains about a product. Informal language.
What are the categories you hope to predict? Is there good reason to suspect the classes are balanced, or will 1 class be much more prevalent?
We have two categories, it is a topic label problem. If the sentence is related to this topic, we label it as 1 and if it is not related, we label as 0. The dataset is very unbalanced. We are definitely going to have more 0 than 1.
If you were to pick a sentence at random & read it, would it be obvious which class it belongs to, or ambiguous?
Ambiguous
Would it be a subjective judgement?
Yes
Do you have any sense, prior to building the model, what attributes are likely to be associated with the different classes?
I do, we selected some keywords that are probably going to be related to the class 1. I do not have it for class 0. Class 0 actually have a lot of different topics, but I do not care about it, I only want to separate one topic of the rest.