12

How can I efficiently select a random element from a std::set?

A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector

I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.

Is there a better approach?

In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".

Edit...

Many insightful answers below.

The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.

Drew Dormann
  • 54,920
  • 13
  • 119
  • 171

6 Answers6

8

Use boost::container::flat_set instead:

boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();

Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.

Benjamin Lindley
  • 98,924
  • 9
  • 191
  • 266
4

What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.

Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.

Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).

Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:

#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>

using namespace std;

template <typename T>
struct RandomPredicate {
    RandomPredicate() : size(0), randomize(false) { }
    bool operator () (const T& a, const T& b) {
        if (!randomize)
            return a < b;

        int r = rand();
        if (size == 0)
            return false;
        else if (r % size == 0) {
            size = 0;
            return false;
        } else {
            size /= 2;
            return r & 1;
        }
    }

    size_t size;
    bool randomize;
};

int main()
{
    srand(time(0));

    RandomPredicate<int> pred;
    set<int, RandomPredicate<int> & > s(pred);
    for (int i = 0; i < 100; ++i)
        s.insert(i);

    pred.randomize = true;
    for (int i = 0; i < 100; ++i) {
        pred.size = s.size();
        set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
        cout << *it << endl;
    }
}

My half-baked randomness test is ./demo | sort -u | wc -l to see how many unique integers I get out. With a larger sample set try ./demo | sort | uniq -c | sort -n to look for unwanted patterns.

Ben Jackson
  • 84,135
  • 9
  • 92
  • 145
  • 2
    You can't change the sort predicate for a set/map in your find/lower_bound call (unless you use `std::find` which is linear and asked to be avoided in the OP). – Mark B Sep 05 '12 at 19:52
  • 1
    @MarkB ah, I was adding the same thing in an edit when you were making your comment. I was hoping someone would know of a similar alternative... – Ben Jackson Sep 05 '12 at 19:54
  • I wonder if you could fool a set into behaving like it were a set with a different predicate. – Drew Dormann Sep 05 '12 at 20:13
  • This is very clever - the only thing I would suggest is to shift away say the lower eight bits of the random number before doing the test (in other words don't use bit 0 as your on/off random check). – Mark B Sep 06 '12 at 15:11
2

If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.

Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.

ecatmur
  • 145,219
  • 25
  • 281
  • 356
1

You can use the std::advance method:

set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element

Another way to do this, probably less random:

int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);
Chris
  • 26,036
  • 5
  • 55
  • 71
  • 8
    `std::advance` has the same performance characteristics of using the increment operator `rnd` times which is what the OP is trying to avoid. – IronMensan Sep 05 '12 at 19:51
  • @IronMensan True. Unfortunately I don't think one can avoid doing that other than by building your own balanced binary tree and then traversing it randomly. – Chris Sep 05 '12 at 19:53
  • @IronMensan I gave this another shot, check my new answer for reference, if you're interested. – Chris Sep 05 '12 at 20:15
1

If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.

Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.

Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.

Mark B
  • 93,381
  • 10
  • 105
  • 184
1

You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.

The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;

Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.

Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).

Chris
  • 26,036
  • 5
  • 55
  • 71