What the heck is random sampling?
Random sampling is useful to gain an understanding of a population without examining every file. By randomly selecting our sample, bias is also eliminated because every “file” has an equal chance of being selected. One word of caution though, if you are trying to look for outliers you will need a large sample size.
That begs the question: How big of a sample do I need to take? The short answer: 30 is usually good. If it is a simple test and not the critical element of your finding, 30 should cover you almost every time.
The longer answer is it depends. You need to consider your objectives, how confident you want to be about your results, how much margin of error is tolerable, and how big and varied the population is. More confidence requires larger samples. Less margin of error also increases sample size. Populations that are less uniform (have higher standard deviations) require larger samples too. And if this is a critical element of your finding, you need even more.
This is a handy online calculator for calculating the sample size needed to estimate the average of a population. Older textbooks like this one are great office resources (this one is super easy to follow). Better yet, it sells for about $10 making it a steal.
Excel: The easy way to pull a random sample
If the population you are reviewing is not numbered, you will need to create an index number for each file.
Excel has a built in random number generator. By using “RandBetween()” we can generate our sample. Enter “1” or the lowest possible index number for the bottom, and the largest possible index number for the top. In this scenario, I will use 1 and 100.
Once you have your function looking like this you copy the formula into other cells. I am pulling a sample of 12, so I will drag the formula down 12 cells.
You will note right away that each time you change something on the sheet, the numbers change. So if you want to lock in a sample, you need to copy the cells with the “randbetween” function and paste them as “values”.
I prefer to paste over the cells I just copied.
Here’s the sample I got:
If you come across a duplicate number, you will need to add another row or replace the duplicate with a new “randbetween” function.
Pitfall: Weak audit trail
One of the drawbacks of Excel is that the audit trail is weak. The documentation you have are numbers in a spreadsheet that you could have easily entered manually. If you are working on a piece of evidence that is critical, you will probably want more documentation on how you arrived at your sample.
Our next post will cover how to document a random sample using technology such as ACL and how to document it the low-tech way using a “random number table”.