How to: Use Benford’s law in ACL

Last time, we spoke about a fraud detection technique known as Benford’s Law.  In summary, it can easily detect outliers in certain sets of data by looking at the first digit of a number.  Those outliers can then be examined to see if they are fraudulent.  For many financial transactions, approximately 30% of the numbers should start with “1” and approximately 5% should start with “9”.  In essence, Benford’s law compares what your data exhibits against what is expected.  For more details, read the link above.

For the following, I’ll show how to do the first digit test and then I’ll show how to do the first two digit test. The two digit test is useful to drill down on the data to show patterns that might not exist in the first digit test. For example, the two digit test is very effective at picking up multiples of ten (10, 20, 30, etc.).

Steps

  • Load data in ACL
  • Analyze Tab ⇒ Benfords
  • Select field to analyze
  • Select number of digits
  • Graph

The following contains real data that was used to help identify and prosecute fraudsters in Oregon.


 

Screenshot of step 2

 benford1

Screenshot of steps 3 & 4

benford2

Screenshot of Step 5

Note the big difference between observed and expected values of the first digit “1”.

benford3

Conducting the first two digit test

After looking at the first digit, consider also looking at the first two digits, even if nothing showed up the first time. Just run Benford again and change to “2” leading digits.

 benford4

Look at all those spikes in the data presented below at multiples of $10. In this instance, many of those were fraudulent.

benford5

And if you look further you find a very suspicious pattern of dozens of $100 transactions in the first 3 digit test.

benford6

Read more about the outcome of the fraud investigation here, and check back next month for more on the how to’s of data analysis in the world of auditing!

 

Data Wonk Featured How To

How to: Apply Benford’s Law in Excel to Detect Fraudulent Activity

We apply Benford’s Law here at the Oregon Audits Division as part of our fraud investigations.

For those who haven’t heard of it yet, Benford’s Law is a natural phenomenon that occurs in certain data sets. Just as the Bell Curve predicts certain distribution of numbers, so does Benford’s. You can use Benford’s to detect fraudulent transactions by looking for outliers.

Benford’s Law predicts that the number 1 will occur more often as the first digit than any other number. In fact, the number 1 is about 6 times more likely to occur than the number 9 (30.1% vs. 4.6%). The law can also be applied to the first two digits and other applications, but we won’t get into that now.

So what data sets conform to Benford’s? Well there are some, like the drainage of rivers, that do not apply to auditing, but there are also plenty of financial transactions that do.  First off, you want to have a dataset that has a large sample size.  Ideally, over 1,000 records.  This is one of the cases when 30 is a very inappropriate sample size.

Second, you want data that is not limited. ATM transactions for example are limited because there are minimum and maximum withdrawals.  They also generally require increments of $20.  Being limited also includes using assigned values like invoice numbers.  All of the digits (1 through 9) should be possible.

The data should also ideally cross multiple orders of magnitude (e.g. 1 to 10, 10 to 100, 100 to 1,000).

Here’s a list of data that should generally conform:

  • Home addresses
  • Bank account balances
  • Census data
  • Accounting related data such as Accounts Receivables
  • Transaction level data

Now that I know what data to use, how can I analyze it? With Excel of course!

Steps:

1 – Load Data in Excel

2 – Calculate first digit

3 – Run Benford’s using Countif

4 – Graph

The following uses real world data that helped to convict several fraudsters in Oregon.

Screenshot of Steps 1 & 2

benfords 1

Using the left function, you can calculate the first digit of a number.

Screenshot of Step 3

benfords 2

Using the countif function, you can calculate the number of first digit in your data. You will need to calculate the percentage too.  The log formula on the right is Benfords Law in numerical form.

Screenshot of Step 4

 benfords 4

Looking at the graph, you can see that the digit 1 is overrepresented. The next step is to drill down on records that do not match Benfords.  A closer examination of these records with a first digit of 1 will yield a large number of $100 transactions.  Those $100 transactions were largely, if not all, fraudulent.  By using Benfords you can quickly identify suspicious patterns to help detect fraud.

Benfords will lead to false positives, so do not assume that if there is an outlier it has to be fraud. Next time, how to do Benford’s in ACL and why you should use the 2-digit Benford’s test.

Data Wonk Featured Fraud Investigation How To

How To: Analyze payroll data using Excel’s SUMIFS function

A few years ago, I worked on an audit at the Oregon Department of Corrections (DOC). Elected officials were concerned that the DOC was spending too much money on overtime. We used a combination of ACL and Excel to conduct this audit work. We followed the National Institute of Corrections Net Annual Work Hours Post Relief Factor methodology.

First, we accessed our state’s central payroll database. We pulled the tables we needed that recorded hours worked by pay code by month. The way the table was organized was not right for Excel, so we first needed to prepare the information by using the Summarize function in ACL. For this example, you will need to summarize on employee classification, pay code, and location. Aggregate both hours and dollars.
With the data prepared, it should like something like this:

Table 1 – Data

sumif post pic 1

The first column denotes the prison. The next is the classification of the correctional employees (e.g. sergeant, corporal, officer etc.). The following column shows pay codes – we had 50 different pay codes in our data set. The next three are straight forward – dollars, hours, and counts. Lastly, we have a short description of the pay codes. For example, CD is career development/training.

After developing this table, I calculated the average for each row by dividing by the FTE in each classification at each prison. For example, there were 3881.5 hours of CD and 184 FTE, yielding an average of 21 CD hours per staff. Now that I have my data ready I can start analyzing it. I want to know if there are differences between locations, classifications, and pay codes to see if this is driving any overtime.

I set up a table in excel, shown below. CCCF stands for Coffee Creek Correctional Institution and TRCI is Two Rivers Correctional Institution. Pay code descriptions are above. Classifications range from officer (C6775) to captain (X6780).

Table 2 – PRF

sumif post pic 2

I can now use this table within my SUMIFS function to pull average hours from the other table.
The SUMIFS function has three main parameters: Sum_range, Criteria_range, and Criteria.

sumif post pic 3

Sum range is the range of data you want summarized. In this case, I wanted to pull average hours from my payroll data. You only have one sum_range, although you can have as many criteria as you want. I will have three criteria. I want the average to come from (1) the correctional institution (2) the employee classification and (3) the pay code.

Columns a, b, and c, from the table 1 will each be a criteria range. I will use table 2 as my criteria. Here is what the formula looks like. Not I use a combination of absolute and relative references (our next post will delve into this in more detail). Absolute references ‘lock’ in cell ranges in a formula so when you drag the formula it does not change. Absolute references are denoted by a “$”.

SUMIFS function

sumif post pic 4

Reading left to right, the function is asked to summarize the averages calculated from the table 1. It is to look at the first column for the prison acronym to match to “CCCF” highlighted in blue. Next it searches employee classification in column b for “C6775”. Finally, it matches pay code “CD” highlighted in purple.

Once I have the one formula set up, I can drag it over and down and calculate over 500 different averages in a few seconds. After setting up one year of data, all you need to do is copy the tab and re-link to the next data and you can compare year-over-year trends in minutes.

There’s a few steps I’m skipping over, but the end result from these calculations looks something like this:

sumif post pic 5

So what did we find in our audit? Overtime is not as big of a problem as people perceive it.

Most people think that overtime has to be more expensive because you are paying time and a half. What is often left out is the cost of leave time and other benefits, which often add up close to 50% of salary making the pay difference negligible. Furthermore, if you hire an officer to replace overtime you must pay them for about 2,000 hours per year. Whereas with overtime, you only need to pay it when you need it. What is cheaper? $65,000 for a new officer or $25,000 for 500 hours of overtime?

So if you pay only 500 hours of overtime per year in a given shift, it doesn’t make sense to hire an officer to cover that time, because they would be paid for hours not needed. Below is a great example.

Overtime at CCCF

sumif post pic 6

You can see that overtime is quite varied. It peaks around hunting and fishing seasons, flu season, and winter holidays. This is not that surprising as more people are calling in sick during these times and someone needs to work the overtime to replace them.

If CCCF hired an additional officer to work these hours, they would only reduce overtime by a small fraction. At best, CCCF could eliminate all of the overtime between 0 and 8 hours. To eliminate all overtime, CCCF would need to hire 6 FTE on the graveyard shift, or 48 hours of coverage, which is vastly more expensive than the cost.

As we found out in this audit, sometimes your gut, i.e. overtime is costly, is wrong.

ian

Post prepared by Guest Blogger and OAD Senior Auditor Ian Green

Data Wonk Featured How To

Auditing How To: Document Sample Selections in ACL

Hello again, fellow data wonks and wonk wannabes!

Last time, we discussed random sampling in Excel and what factors you should consider when determining your sample size. (Hint: 30 is generally large enough, but not in all cases)
One of the downfalls of Excel is the lack of an audit trail. In these examples, we will provide a high-tech and low-tech way to document your sample selection process in detail. First up, ACL.

The High-Tech Method

I am working with fictional data below. As you can see, our population contains 36 counties. Make note of your population size when working in ACL as this will be important later on. You can count a table by using the shortcut “CRTL + 3”.
ACL sampling pic 1

Next you select the “Sampling” menu and click on “Sample Records”. This also has a shortcut, which is “CTRL + 9”.

ACL sampling pic 2

Change “Sample Type” from “MUS” to “Record”. Then click on “random” on the middle left of the interface. Enter in the “size” of the sample. I pulled a sample of 10. The “Seed” allows you to document and repeat a random sample. Any number will do – just pick the first one that comes to mind.

I know what you’re thinking. However, just because something is repeatable does not change the fact that it is random.

Enter in the “population” we recorded earlier, then define the table name you want the sample sent to.

ACL sampling pic 3

There you have it; a random sample of 10 counties in Oregon, with a full log file and repeatable methodology in case you ever get questioned about how you pulled your sample.

The Low-Tech Method

If you are still hung up on what a seed has to do with random sampling, the low tech way will make it clear to you. Below we have a copy of a random number table. You can find these in the appendix of most statistics textbooks or via Google.

ACL sampling pic 5

The “seed” tells you where to start on the table. If I have a seed of 1, we would start at the 1st number, which also happens to be a 1. A seed of “3” start at the 3rd number in which in this case is 4. This is what makes it repeatable. Our population was 36, so to pull a sample we will be looking at sequences of 2-digit numbers. I will use a seed of “3” and pull just three samples.

In the random number table to the right, I’ve crossed out the first two numbers since our seed was “3”. ACL sampling pic 6Starting with the 3rd number, I looked at each 2 digit sequence. If the number fell between 01 and 36, it was a valid random sample and highlighted in green. If the number was above 36, I moved to the next sequence. Also, if repeats are not allowed in your sample you would move to the next number as well (e.g. 11 would be my next sample, but it was already pulled so I would skip over the repeat). Keep moving right and down until you have pulled the full sample.

In this case, my sample was 01, 11, and 20 or Baker, Gilliam, and Lane (shown below). Functionally, this manual low-tech process is identical to what ACL does.

ACL sampling pic 7

You can apply the Random Number table approach to extremely large files. If you had 1,000,000 records you would look at 7-digit sequences rather than 2-digit shown above.

And there we have it! Two useful methods for documenting sample selection.

If you are stuck on a project in ACL, Excel, or ArcGIS please submit your topic suggestions for a future blog post.

Auditors at Work Data Wonk Featured How To New Audit Release

So that’s how you do a random sample in Excel

We’ve all been there. The boss shows up and says “I want you to select a random sample of files for the audit”. The boss leaves and you frantically begin searching for your old college textbooks.
Fear these technical challenges no longer. The Oregon Audits blog will be rolling out new posts covering practical and useful audit tools. Random sampling will be our first topic, but if you have any requests please don’t hesitate to contact us.

What the heck is random sampling?

Random sampling is useful to gain an understanding of a population without examining every file. By randomly selecting our sample, bias is also eliminated because every “file” has an equal chance of being selected. One word of caution though, if you are trying to look for outliers you will need a large sample size.

That begs the question: How big of a sample do I need to take? The short answer: 30 is usually good. If it is a simple test and not the critical element of your finding, 30 should cover you almost every time.

The longer answer is it depends. You need to consider your objectives, how confident you want to be about your results, how much margin of error is tolerable, and how big and varied the population is. More confidence requires larger samples. Less margin of error also increases sample size. Populations that are less uniform (have higher standard deviations) require larger samples too. And if this is a critical element of your finding, you need even more.

This is a handy online calculator for calculating the sample size needed to estimate the average of a population. Older textbooks like this one are great office resources (this one is super easy to follow). Better yet, it sells for about $10 making it a steal.

Excel: The easy way to pull a random sample

If the population you are reviewing is not numbered, you will need to create an index number for each file.

Excel has a built in random number generator. By using “RandBetween()” we can generate our sample. Enter “1” or the lowest possible index number for the bottom, and the largest possible index number for the top. In this scenario, I will use 1 and 100.

how to 1

Once you have your function looking like this how to 2you copy the formula into other cells. I am pulling a sample of 12, so I will drag the formula down 12 cells.

how to 3

You will note right away that each time you change something on the sheet, the numbers change. So if you want to lock in a sample, you need to copy the cells with the “randbetween” function and paste them as “values”.

how to 4

I prefer to paste over the cells I just copied.

Here’s the sample I got:how to 5
If you come across a duplicate number, you will need to add another row or replace the duplicate with a new “randbetween” function.

Pitfall: Weak audit trail

One of the drawbacks of Excel is that the audit trail is weak. The documentation you have are numbers in a spreadsheet that you could have easily entered manually. If you are working on a piece of evidence that is critical, you will probably want more documentation on how you arrived at your sample.

Our next post will cover how to document a random sample using technology such as ACL and how to document it the low-tech way using a “random number table”.

 

 

Data Wonk Featured How To