I am working with a data set of fake job postings and it has the columns following columns:
data.columns
Out[18]:
Index(['title', 'location', 'description', 'requirements', 'telecommuting',
'has_company_logo', 'has_questions', 'fraudulent', 'title_tokenized',
'description_tokenized', 'requirements_tokenized'],
dtype='object')
The issue is:
pos_instances = data[data['fraudulent']==1].shape[0]
neg_instances = data[data['fraudulent']==0].shape[0]
print('There are {} data points for positive class, and {} data points for the negative class.'.format(pos_instances,neg_instances))
print('The ratio of positive class to negative class is {}.'.format(round(pos_instances/neg_instances,2)))
print('The data is highly imbalanced.')
del pos_instances, neg_instances
There are 705 data points for positive class, and 14310 data points for the negative class.
The ratio of positive class to negative class is 0.05.
The data is highly imbalanced.
The data is highly imbalanced. Imputation is not viable because the data is textual. I cannot impute a fake review.
Any ideas to deal with this issue are welcome. I presently cannot see any other way to solve this but to under-sample the negative class.
Thanks for sharing the link to your answer. It was an informative read. I personally never knew that proper scoring rules like Brier existed. So, I'm really thankful to you for nudging me in the right direction. However, there are a few questions I have in my mind: Does the ratio of each class not affect the equation of decision boundary irrespective of the scoring rule? Even if we are training the models using a proper scoring rule such as Brier, does the ratio of representation of each class not matter? More importantly, do decision boundaries even truly exist? – rxp3292 Aug 30 '20 at 07:43