34

Is there a data structure that takes an unordered array of $n$ items, performs preprocessing in $O(n)$ and answers queries: is there some element $x$ on the list, each query in worst time $O(\log n)$?

I really think there isn't, so a proof that there is none is also welcomed.

Artem Kaznatcheev
  • 10,251
  • 12
  • 74
  • 174
Chi-Lan
  • 443
  • 3
  • 6
  • So you do consider expected time? – Raphael Aug 18 '11 at 09:42
  • @Raphael, I don't understand. Of course I consider the expected time, that's actually the only limitation I pose. – Chi-Lan Aug 18 '11 at 10:38
  • So, on the first case, is the list sorted when you search it? – Filip Ekberg Aug 18 '11 at 11:10
  • @Filip, no the list is never sorted. Sorted list is a data structure you can use in order to find an element in the list quickly. – Chi-Lan Aug 18 '11 at 12:10
  • 3
    (1) I do not know why you can say “Of course I consider the expected time,” because you do not state “expected” in your question at all. Please try to state your question more precisely before saying “of course.” (2) Please define “non-hashable.” – Tsuyoshi Ito Aug 18 '11 at 12:32
  • @Tsuyoshi @Raphael I apologize, I didn't understand that in English expected time means E[time]. I thought Raphael meant "do you care about the running time. So no, I speak about worst case time. (2) non-hashable, you are not allowed to calculate the hash of those items. You cannot use a hash table for this values. The only allowed action is to compare two values in the list. – Chi-Lan Aug 18 '11 at 12:46
  • 2
    (1) I see. Thanks for the explanation. If someone asked “Do you care about the running time?” then the answer would be indeed “of course.” :) (2) I think that “The only allowed action is to compare two values in the list” is much more precise than just stating “non-hashable.” Can you edit the question so that people do not have to read the comments to understand what “non-hashable” means? – Tsuyoshi Ito Aug 18 '11 at 12:54
  • 3
    By the way, if you cannot prove it, why do you know it is impossible? If it is an exercise in a textbook or a class, you are asking on a wrong website. – Tsuyoshi Ito Aug 18 '11 at 12:56
  • I'm not entirely sure, but it looks very "reasonable". I know it's impossible, generally speaking, to sort a list with comparison-only model with less than O(n log n), and I know that all known data structures which are based on comparison only (heap, red-black tree, etc), cannot be built with less than O(n log n), so I'll be very surprised to know that it's possible. (I don't think that it's a known open problem, because I never heard of it, but maybe it is, I don't know). – Chi-Lan Aug 18 '11 at 12:59
  • And just to refer to your "accusations". It is not a problem in a text book. It thought about it when trying to solve a puzzle I heard. – Chi-Lan Aug 18 '11 at 13:01
  • "all known data structures which are based on comparison only (heap ". Heap can be built in $O(n)$ time. – Chao Xu Aug 18 '11 at 13:03
  • @Chi-Lan, you can't perform a O(log n) search on an unsorted/not balanced data structure. – Filip Ekberg Aug 18 '11 at 13:07
  • 6
    Is this your question: Is there a data structure that takes an unordered array of n items, performs preprocessing in O(n) and answers queries: is there some element x on the list, each query in worst time O(log n)? – sdcvvc Aug 18 '11 at 13:11
  • 1
    Just in case anyone is trying to prove the opposite. Here is one idea: generate $O(n/\log n)$ list of size $O(\log n)$, locate which list contains the element in $O(\log n)$ time and do a linear search. One have to come up with a clever way to generate the lists. The naive candidate "partition the list such that all element in one list is smaller than the next one" can't be done in $O(n)$ time.(since one can use it recursively and show sorting is $O(n)$). – Chao Xu Aug 18 '11 at 13:18
  • @sdcvvc exactly, I'm editing your question it in. One cavaet, the algorithm must only compare between two elements of the list. – Chi-Lan Aug 18 '11 at 13:25
  • 2
    @Filip: Is that easy to see? If it is true, then I agree that it solves the question. – Tsuyoshi Ito Aug 18 '11 at 13:29
  • @Tsuyoshi Ito, Hmm, I don't know the correct way to express a "Proof" but by example, maybe? Let me think about it a bit more.. – Filip Ekberg Aug 18 '11 at 21:41
  • 1
    @Filip: You want an adversary argument. Take the $O(n)$ preprocessing algorithm. It lumps together $n!/\exp(n)$ of the permutations to the same state. Now take the query algorithm, and find a sequence of consistent answers for some query $x$ for which the algorithm answers "no" without knowing for sure. The partial queries reduce the space of permutations (e.g. if $x<a$ and $b<x$ then $b<a$). You only know that $x$ is not there if $a<x<b$ for two adjacent $a,b$ (adjacent on all permutations!) or $x<a_{\min}$ or $x>a_{\max}$. – Yuval Filmus Aug 19 '11 at 15:18
  • 1
    Alternatively, run the same adversary argument, and show that it impossible that for all $n$ possible "yes" queries, the algorithm will answer "yes" honestly (i.e. find the actual element). – Yuval Filmus Aug 20 '11 at 15:59
  • 2
    Is there any reason that no one is actually answering :), and just using the comments ? – Suresh Venkat Aug 20 '11 at 16:13

3 Answers3

30

Here's a proof that it's impossible. Suppose you could build such a data structure. Build it. Then choose $n/\log n$ items at random from the list, add $\epsilon$ to each of them, where $\epsilon$ is smaller than the difference between any two items on the list, and perform the queries to check whether any of the resulting items is in the list. You've performed $O(n)$ queries so far.

I would like to claim that the comparisons you have done are sufficient to tell whether an item $a$ on the original list is smaller than or larger than any new item $b$. Suppose you couldn't tell. Then, because this is a comparison-based model, you wouldn't know whether $a$ was equal to $b$ or not, a contradiction of the assumption that your data structure works.

Now, since the $n/\log n$ items you chose were random, your comparisons have with high probability given enough information to divide the original list into $n/\log n$ lists each of size $O(\log n)$. By sorting each of these lists, you get a randomized $O(n \log \log n)$-time sorting algorithm based solely on comparisons, a contradiction.

Peter Shor
  • 24,763
  • 4
  • 93
  • 133
  • A few hints to help understanding the proof (assuming I understand it correctly myself of course): the $b$ items should be filled in by the items after $\epsilon$ has been added to them; the comparison model guarantees you know which of the cases $a \leq b$ and $a \geq b$ holds; the $n / \log n$ lists are in 'ascending order': every element in any higher list is higher than every element in any lower list; after the original queries you have enough information to make the lists around the items you randomly chose, – Alex ten Brink Aug 21 '11 at 13:17
  • (continued) note that you don't even have to explicitly be able to build the list in the provided time for the proof to hold. – Alex ten Brink Aug 21 '11 at 13:17
  • I don't quiet understand this proof. The final contradiction is off "algorithm based solely on comparisons", but in the first steps of our algorithm we added $\epsilon$ to each item (further, "where $\epsilon$ is smaller than the difference between any two items on the list"). Why are we still justified that our algorithm still only comparison based if we assumed our items have a non-discrete total order on them? – Artem Kaznatcheev Aug 21 '11 at 17:40
  • 5
    @Artem: Your original input consists of elements $x \in X$. Then you construct a new set $X' = X \times {0,1}$; you represent an original $x \in X$ as $(x,0) \in X'$ and a modified $x + \epsilon$ as $(x,1) \in X'$. Now you use the black box algorithm; the algorithm compares elements of $X'$ to each other; to answer such queries, you only need to compare a constant number of elements of $X$ to each other. Hence everything should be doable in the comparison model, with a constant overhead. – Jukka Suomela Aug 21 '11 at 18:48
  • @Alex "note that you don't even have to explicitly be able to build the list in the provided time for the proof to hold." why? I don't see how to sort those lists without construct them... – Chao Xu Aug 22 '11 at 13:24
  • nvm I understand, construct the list doesn't need any more comparisons, and for sorting only comparisons should be counted. – Chao Xu Aug 22 '11 at 13:48
  • I am pretty sure I am missing something. Doesn't this also disprove existence of an $O(\log^2n)$ algorithm (which exists)? (Replacing $n/\log n$ by $n/\log^2 n$ throughout). Thanks. – Aryabhata Aug 23 '11 at 19:04
  • 1
    @Aryabhata: it does. What is the $O(\log^2 n)$ algorithm? – Peter Shor Aug 23 '11 at 19:08
  • @Peter: Hmm... I thought there was one, but I believe that is where I must be mistaken. Thank you. – Aryabhata Aug 23 '11 at 19:12
24

I believe here is a different proof, proving the impossibility of an $\mathcal{O}(\log ^k n)$ query time structure, with $\mathcal{O}(n)$ pre-processing.

Suppose in the preprocessing you do $\mathcal{O}(n)$ comparisons, leading to a partial order.

Now consider the size $A$ of the largest antichain in that. Since these elements are not comparable, for us to have an $\mathcal{O}(\log ^k n)$ query algorithm, we must have that $A = \mathcal{O}(\log ^k n)$.

Now by Dilworth's theorem, there is a partition of size $A$, into chains.

Now we can complement the algorithm to determine the chains in the partition. We can determine if two elements are comparable by creating a directed graph of comparisons and doing a reachability analysis. This can be done without any additional comparisons. Now just brute force out each possible partition of size $A$ to determine if it is a partition of chains.

Once we have the chains, we can merge them to give an $\mathcal{O}(n \log \log n)$ comparisons algorithm for sorting the whole list.

Aryabhata
  • 1,855
  • 3
  • 21
  • 29
  • 1
    This is a nice idea. And if you could show that the chain partition must be known to the algorithm then you could use mergesort to show that it would only take an additional O(n log log n) comparisons to sort the whole input, rather than using Jensen. But there is a problem: why must the preprocessing algorithm construct a chain partition? A chain partition must exist, yes, but that's very different from it being known to the algorithm. – David Eppstein Aug 23 '11 at 20:49
  • @David: We should be able to complement the algorithm to construct the partitions without doing any additional comparisons, for instance by considering every possible partition and checking if it is a chain partition. All we need to know is whether two elements are comparable, which I suppose we can do by some reachability analysis on the directed graph of compares. We don't need to do any additional compares. And given a 'bag' of elements which form a chain, we should be able to arrange them in ascending order (i.e. sort them) without doing any more compares. – Aryabhata Aug 23 '11 at 20:55
  • The last paragraph seems to be lacking. It could happen that on a specific input, $n-1$ comparisons are enough to determine the entire total order. What you need to show is that sometimes you'll get an underdetermined partial order. – Yuval Filmus Aug 23 '11 at 21:46
  • @Yuval: "In the worst case" is implicit (but could have potential holes). David's suggestion makes it more clear. I will edit. Thanks. – Aryabhata Aug 23 '11 at 22:08
  • 8
    Ok, I now believe this proof. And it shows more strongly that to achieve polylog query time you have to use a number of comparisons that's within an additive $O(n\log\log n)$ of sorting. Nice. By the way, the chain partition can be found in polynomial time from the set of comparisons already performed, rather than needing a brute force search, but that doesn't make any difference for your argument. – David Eppstein Aug 23 '11 at 23:08
  • 6
    The proofs actually show that you must have either $\Omega(n\log n)$ preprocessing or $\Omega(n)$ for each query. Of course both are tight. This shows that binary search and linear search are the only "interesting" search algorithms (at least in the classical world). – Yuval Filmus Aug 26 '11 at 15:12
  • 1
    @Yuval: maybe you should write up this observation as an actual answer, because it seems to me that you have to do a moderate amount of work to get the above result from the proofs in the answers. – Peter Shor Aug 31 '11 at 01:05
  • 1
    @Yuval: thinking about the proofs, I only see that you must have either $\Omega(n \log n)$ preprocessing or $\Omega(n^{1-\epsilon})$ query time for all $\epsilon$. It's possible to have $o(n \log n)$ preprocessing time and $O(n / \log n)$ query time. One can divide the list into $\log n$ lists of size $n/\log n$ each in time $\theta(n \log\log n)$ using repeated median-finding. – Peter Shor Sep 19 '11 at 21:26
0

As Peter Shor's answer notes, to rule out membership in a comparison-based model, we need to know how the element compares with every member. Thus, using $k<n$ random queries (the number of members smaller than the queried nonmember is random), we gain $Θ(n \log k)$ information relative to having $n$ unsorted values. Therefore, for some constant $c>0$, using $c \, n \log k$ preprocesssing, we cannot have $≤c \, n \log k/k$ query cost. This is optimal up to a constant factor since we can sort the data into $k' = k / \log k ≤ n/\log n$ approximately equal buckets (each bucket unsorted) in $O(n \log k') = O(n \log k)$ time, which allows $O(n/k')$ query cost.

In particular, using $O(n)$ preprocessing, we cannot have $o(n)$ query cost. Also, $o(n \log n)$ preprocessing corresponds to $k$ in $O(n^ε)$ for every $ε>0$ and thus $Ω(n^{1−ε})$ query cost.

Dmytro Taranovsky
  • 2,577
  • 1
  • 10
  • 24