0

I'm looking for an algorithm which can optimize the selection of queries used to build a decision tree. However, unlike most decision trees, I am constrained to ask the same set of questions for every case.

I'll couch this in the form of the parlor game Twenty Questions . One person thinks of something (e.g. an elephant) and the other needs to guess the object based on a set of binary questions. For examples, "is it bigger than a breadbox?" and "could I put it in my mouth?".

In principle, a perfect decision tree could select from up to 2**20 ~ 1 million different answers. This assumes that there is no correlation between one question and the next. (There are very few things bigger than a breadbox that I can put into my mouth. Cotton candy is likely one of them.)

As a point of strategy, and quoting from the above Wikipedia link:

Careful selection of questions can greatly improve the odds of the questioner winning the game. For example, a question such as "Does it involve technology for communications, entertainment or work?" can allow the questioner to cover a broad range of areas using a single question that can be answered with a simple "yes" or "no". If the answerer responds with "yes," the questioner can use the next question to narrow down the answer; if the answerer responds with "no," the questioner has successfully eliminated a number of possibilities for the answer.

The goal of this sort of broad question is to increase the selection/rejection point to 50% of the universe under consideration. This gives the questioner the most information. Question 1: Given a set of several million single-topic questions with low discrimination, how does one construct a more informative multi-topic question which is the union of several of those low-level questions?

Also, as I mentioned, I have an extra constraint. I always have to ask the same set of questions. I can't use feedback from a previous question to drive the next.

In my case, I'm building an inverted index as part of a chemical substructure search. There are some 3 million unique features in my target set of about 50 million records. These features include "contains a ring of size 5" and "contains at least 4 carbon atoms." About 1/2 of these features only exist in a single record. When a substructure query comes in, I can identify its features, get the corresponding inverted indices, and get the intersection. This can help speed up the search a lot.

For legacy reasons, I am limited to storing 1024 indices. This should be enough to distinguish between 2**1024 different molecules, but unfortunately most features aren't that discriminatory, either because they are rare or because they are too well correlated with other features. Question 2: How can I take this into account when selecting my list of pre-determined queries?

Algorithms like C4.5 for making decision trees are well known. However, C4.5 picks a single feature for each decision. I would like to enrich discrimination by merging multiple questions into one. ("Does it have at least 16 carbon atoms or at least 5 oxygen atoms"?) It must also have a cost model because it takes time to evaluate some of the more complicated features, and that time might not be worthwhile.

I don't know where to go from here. I can't come up with a good heuristic myself. My hope is Twenty Questions is well-known, so there's existing work related to this topic that people can refer me to.

Gala
  • 8,501
  • I think you should start experimenting with some known decision tree model, like the one you mention. But you should not only throw at it the features you already have, you should use your insight (and maybe some analysis of correlation between features) to specifically design your own variables, like the one you mention. Then see how well that works. Or you could prepare a competition to post it at a site like www.kaggle.com See also https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 – kjetil b halvorsen Jul 09 '13 at 18:50
  • What you've suggested is the usual approach in this field. (The MACCS keys and PubChem/CACTVS features do that.) What I can't figure out is how to enrich discrimination by merging multiple features into one. My attempts so far haven't been fruitful, but since I doubt I'm the first one to work on this, I decided to look around for suggestions. Regarding Kaggle, I don't yet have enough data to make a good competition benchmark. – Andrew Dalke Jul 10 '13 at 15:29

0 Answers0