Hello again, fellow data wonks and wonk wannabes!
Last time, we discussed random sampling in Excel and what factors you should consider when determining your sample size. (Hint: 30 is generally large enough, but not in all cases)
One of the downfalls of Excel is the lack of an audit trail. In these examples, we will provide a high-tech and low-tech way to document your sample selection process in detail. First up, ACL.
The High-Tech Method
I am working with fictional data below. As you can see, our population contains 36 counties. Make note of your population size when working in ACL as this will be important later on. You can count a table by using the shortcut “CRTL + 3”.
Next you select the “Sampling” menu and click on “Sample Records”. This also has a shortcut, which is “CTRL + 9”.
Change “Sample Type” from “MUS” to “Record”. Then click on “random” on the middle left of the interface. Enter in the “size” of the sample. I pulled a sample of 10. The “Seed” allows you to document and repeat a random sample. Any number will do – just pick the first one that comes to mind.
I know what you’re thinking. However, just because something is repeatable does not change the fact that it is random.
Enter in the “population” we recorded earlier, then define the table name you want the sample sent to.
There you have it; a random sample of 10 counties in Oregon, with a full log file and repeatable methodology in case you ever get questioned about how you pulled your sample.
The Low-Tech Method
If you are still hung up on what a seed has to do with random sampling, the low tech way will make it clear to you. Below we have a copy of a random number table. You can find these in the appendix of most statistics textbooks or via Google.
The “seed” tells you where to start on the table. If I have a seed of 1, we would start at the 1st number, which also happens to be a 1. A seed of “3” start at the 3rd number in which in this case is 4. This is what makes it repeatable. Our population was 36, so to pull a sample we will be looking at sequences of 2-digit numbers. I will use a seed of “3” and pull just three samples.
In the random number table to the right, I’ve crossed out the first two numbers since our seed was “3”. Starting with the 3rd number, I looked at each 2 digit sequence. If the number fell between 01 and 36, it was a valid random sample and highlighted in green. If the number was above 36, I moved to the next sequence. Also, if repeats are not allowed in your sample you would move to the next number as well (e.g. 11 would be my next sample, but it was already pulled so I would skip over the repeat). Keep moving right and down until you have pulled the full sample.
In this case, my sample was 01, 11, and 20 or Baker, Gilliam, and Lane (shown below). Functionally, this manual low-tech process is identical to what ACL does.
You can apply the Random Number table approach to extremely large files. If you had 1,000,000 records you would look at 7-digit sequences rather than 2-digit shown above.
And there we have it! Two useful methods for documenting sample selection.
If you are stuck on a project in ACL, Excel, or ArcGIS please submit your topic suggestions for a future blog post.