I'm looking for an algorithm which can optimize the selection of queries used to build a decision tree. However, unlike most decision trees, I am constrained to ask the same set of questions for every case.
I'll couch this in the form of the parlor game Twenty Questions . One person thinks of something (e.g. an elephant) and the other needs to guess the object based on a set of binary questions. For examples, "is it bigger than a breadbox?" and "could I put it in my mouth?".
In principle, a perfect decision tree could select from up to 2**20 ~ 1 million different answers. This assumes that there is no correlation between one question and the next. (There are very few things bigger than a breadbox that I can put into my mouth. Cotton candy is likely one of them.)
As a point of strategy, and quoting from the above Wikipedia link:
Careful selection of questions can greatly improve the odds of the questioner winning the game. For example, a question such as "Does it involve technology for communications, entertainment or work?" can allow the questioner to cover a broad range of areas using a single question that can be answered with a simple "yes" or "no". If the answerer responds with "yes," the questioner can use the next question to narrow down the answer; if the answerer responds with "no," the questioner has successfully eliminated a number of possibilities for the answer.
The goal of this sort of broad question is to increase the selection/rejection point to 50% of the universe under consideration. This gives the questioner the most information. Question 1: Given a set of several million single-topic questions with low discrimination, how does one construct a more informative multi-topic question which is the union of several of those low-level questions?
Also, as I mentioned, I have an extra constraint. I always have to ask the same set of questions. I can't use feedback from a previous question to drive the next.
In my case, I'm building an inverted index as part of a chemical substructure search. There are some 3 million unique features in my target set of about 50 million records. These features include "contains a ring of size 5" and "contains at least 4 carbon atoms." About 1/2 of these features only exist in a single record. When a substructure query comes in, I can identify its features, get the corresponding inverted indices, and get the intersection. This can help speed up the search a lot.
For legacy reasons, I am limited to storing 1024 indices. This should be enough to distinguish between 2**1024 different molecules, but unfortunately most features aren't that discriminatory, either because they are rare or because they are too well correlated with other features. Question 2: How can I take this into account when selecting my list of pre-determined queries?
Algorithms like C4.5 for making decision trees are well known. However, C4.5 picks a single feature for each decision. I would like to enrich discrimination by merging multiple questions into one. ("Does it have at least 16 carbon atoms or at least 5 oxygen atoms"?) It must also have a cost model because it takes time to evaluate some of the more complicated features, and that time might not be worthwhile.
I don't know where to go from here. I can't come up with a good heuristic myself. My hope is Twenty Questions is well-known, so there's existing work related to this topic that people can refer me to.