A simulator to determine how many items you need to pull from an infinite random set until you have the entire set. For example, this can be used to work out how many times you need to roll a fair six-sided die until you see each number at least once. This problem is known as the Coupon collector's problem (as I only found out after writing all this code).
This code offers two different solutions: the mean and the mode.
The mean can be thought of as a situation where you have thousands of people all rolling dice until they see each face once.
The mean will be the average of the counts for each person.
The mode tells you the most common number of rolls you need to do to finish.
This appears to always be less than the mean.
The Results section has the results for some common cases.
n=2
is for a fair coin, n=4, 6, 8, 10, 12, 20
are for standard dice with 4 through to 20 faces, n=54
and n=52
are for decks of cards with and without jokers, respectively.
Compile with
gcc shuffler.c -o shuffler
There are several compiler flags that can be used to modify the behaviour of the program.
Flag | Effect |
---|---|
MEAN |
Returns the mean value of all the runs |
MODE |
Returns the mode of the run |
MEDIAN |
Returns the median of the run |
FREQ_TABLE |
Returns the count of each value when used in conjunction with MEDIAN or MODE |
DEBUG |
Prints debug output while running |
VERBOSE or V |
Prints verbose debug output when used in conjunction with DEBUG |
TOTAL_PRINT |
Prints the total of each run |
At least one of MEAN
, MEDIAN
, or MODE
must be used otherwise no result will be printed.
All other flags are optional.
gcc shuffler.c -o shuffler -DFLAG1 -DFLAG2
For example, if you want to print the mean, mode, and the frequency table for the mode, you would use:
gcc shuffler.c -o shuffler -DMEAN -DMODE -DFREQ_TABLE
The order of the flags does not matter, but they must all begin with -D
.
Run the program with
./shuffler n i
where n
is the number of different random items to choose from and i
is the number of times to run the test.
You may be curious about the results for some numbers.
Each of these tests was done for one million runs ./shuffler n 1000000
, however, results may still vary between runs.
n |
Mean | Median | Mode |
---|---|---|---|
2 | 2 | 2 | 3 |
3 | 5.5 | 5 | 4 |
4 | 8.333 | 7 | 6 |
5 | 11.417 | 10 | 8 |
6 | 14.700 | 13 | 11 |
8 | 21.743 | 20 | 17 |
10 | 29.290 | 27 | 23 |
12 | 37.239 | 35 | 28 |
20 | 71.955 | 67 | 59 |
52 | 235.978 | 225 | 209 |
54 | 247.073 | 235 | 217 |
As n
increases, the mode (and, to a lesser extent, the mean) begins to vary significantly between runs.
There is a mathematical way to find a solution to this problem, which is great because it's so much faster to solve than taking millions of samples.
Without going into the details of how this expression was found (hint: it uses probabilities and expected values, but is not quite the same as in the Wikipedia article for the Coupon collector's problem, I promise), the equation for the mean is given below:
The beauty of this is that it is simply the Harmonic number () multiplied by
n
.
It can even be approximated for large n
with:
where is the Euler-Mascheroni constant.
I haven't come up with a mathematical expression for this yet, but the median is expressed by OEIS Sequence A073593.
There is also a software implementation for the mean so you can compare the speed for yourself.
The time complexity is O(n)
as opposed to O(ni)
for the non-mathematical implementation, where i
is the number of iterations.
As with the non-mathematical implementation, the code can be compiled the same way (but using the math_shuffler.c
file instead) with the following compiler flags:
Flag | Effect |
---|---|
MEAN | Returns the equivalent of MEAN in the original implementation |
This version only requires one argument (n
) as it only ever needs to run one iteration.