How limited is my dataset?

Question

So I have multiple datasets which are all histograms and cannot be linked. The topic of the data is quite complex so for example imagine I was surveying different qualities between men and women. The data I have is equivalent to me spending a day running around the town centre surveying men, and the next day running around the town centre surveying women. I ask them only 1 question, such as their height, shoe size or favourite colour. I don't record who gave which answer - so I cannot say their height AND shoe size. Just gender and answer to one question. Thus I only have 6 datasets, 3 for each gender. Each dataset has three columns and the gender, their answer and number of times that answer was given. I.e. histogram data.

I am trying to classify people as male or female based on these three questions, but again can only ask any new people only 1 question. I currently have three histograms, one for each question, and each histogram has two plots - one for men, one for women. All I can think to do so far is make a cut where the two histograms intercept and classify any new people as male or female based on which side they are more likely to be in (so if this new person says their height is 5 foot 2, they would be classified as a woman because it is more likely). Is there any other classification techniques I can use on a dataset like this, or is this really the only method?

To make it more clear, my data is in the format pictured below - as you can see the data is not linked so I don't know more than 1 piece of information about any individual.

It would be better to use numerical attributes like height or shoe size "as-is", rather than bin them. Essentially, it sounds like you want to build three naive Bayes classifiers, one per question. This will give you a posterior probability of a respondent to be male or female. Thresholding is a separate topic. — Stephan Kolassa, Jul 25 '22 at 14:15
@StephanKolassa The issue is I receive the data in the histogram format described above, i.e. the observable such as height and then the number of times that height was given as the answer. In reality this is data from particle physics and receiving the data as a histogram is a limitation of particle detectors can record the data. — sputnik44, Jul 25 '22 at 14:22
OK. A naive Bayes classifier still looks like the tool of choice. — Stephan Kolassa, Jul 25 '22 at 14:30
I don't see a problem. You have the heights and corresponding genders; model that and use your new heights to predict the gender. Ditto for shoe size and favorite color. Whatever information you have, use that to make your prediction. // @StephanKolassa Why naïve Bayes over a logistic regression, random forest, or neural network? — Dave, Jul 25 '22 at 14:30
@Dave: with aggregated and tabulated input data, I would go for naive Bayes... — Stephan Kolassa, Jul 25 '22 at 14:33
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Jul 25 '22 at 15:15
@StephanKolassa Does naive Bayes not require the data to be joined in the sense that for each person surveyed, I would need to know their height, shoe size and favourite colour? When in my case I know only one of these things about each person, this means that given some new data, I cannot go through and multiply the likelihoods for each observable together because I only have information about 1 observable? — sputnik44, Jul 25 '22 at 20:40
@Dave The issue is that the data isn't joined, so if I did what you said it would be for each table (in the original post) separately. For naive bayes this would be no different to what I have already described doing in the original post - classifying the data point as whichever histogram is taller at that particular value for "x". — sputnik44, Jul 25 '22 at 21:10
That is why I am suggesting to build three different classifiers, one for each of your three characteristics. Then, when you get a single piece of information about a new individual, you can query the corresponding classifier. — Stephan Kolassa, Jul 26 '22 at 05:14

How limited is my dataset?

0 Answers0