with Ron Rivest and Philip Stark
A simple random sample (SRS) of size k from a population of size n is a sample drawn at random in such a way that every subset of k of the n items is equally likely to be selected. The theory of inference from SRSs is fundamental in statistics; many statistical techniques and formulae assume that the data are an SRS. True SRSs are rare; in practice, people tend to draw samples by using pseudo-random number generators (PRNGs) and algorithms that map a set of pseudo-random numbers into a subset of the population. Most statisticians take for granted that the software they use ‘‘does the right thing,’’ producing samples that can be treated as if they are SRSs. In fact, the PRNG algorithm and the algorithm for drawing samples using the PRNG matter enormously. Using basic counting principles, we show that some widely used methods cannot generate all SRSs of size k. In simulations, we demonstrate that the subsets that they do generate do not have equal frequencies, which introduces bias and makes uncertainty calculations meaningless. We compare the ‘‘randomness’’ and computational efficiency of commonly-used PRNGs to a PRNG based on the SHA-256 hash function, which avoids these pitfalls because its state space is countably infinite. We propose several best practices for researchers using PRNGs, including the wide adoption of hash function based PRNGs.