Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Testing against Benford's Law Suppose that you measure some naturally-occurring phenomenon, such as the land area of cities or lakes, or the price of stocks

Testing against Benford's Law

Suppose that you measure some naturally-occurring phenomenon, such as the land area of cities or lakes, or the price of stocks on the stock market. You can plot a histogram of of the first digits of your measurements, showing how frequent each first digit (1-9) is.

Your measurements were made in some arbitrary units, such as square miles or dollars. What if you made the measurements in some other units, such as acres or euros? You would not expect the shape of the histogram to differ just because you changed the units of measurement, because the units are arbitrary and unrelated to the underlying phenomenon. If the shape of the histogram did change, that could indicate that the data were fraudulent. This approach has been used to detect fraud, especially in accounting and economics but also in science.

Data are called scale-invariant if measuring them in any scale (say, meters vs. feet vs. inches) yields similar results, such as the same histogram of first digits. Many natural processes produce scale-invariant data. Benford's Law states the remarkable fact there is only one possible histogram that results from all of these processes! Let P(d)

be the probability of the given digit d{1,2,,9} being the first one in a measurement. Then, we have P(d)=log10(d+1)log10(d)

_NOTE: Benford's law only holds when the data has no natural limit nor cutoff. For example, it would not hold for grades in a class (which are in the range 0% to 100% or 0.0 to 4.0) nor for people's height in centimeters (where almost every value would start with 1 or 2).

(A) Below is the Python code that parses the population data, plots the distribution of their first digits side-by-side with the "theoretical" distribution, and performs hypothesis testing (using a 2 test) to know how well the observed first-digit frequencies conform to the theoretical distribution.

But the p-value computed by the supplied code is WRONG! The theoretical distribution in this case should be governed by Benford's Law, but the current implementation in the benford() function uses the uniform distribution.

Main question: modify the benford() function so it reflects Benford's Law instead.

(Next question if possible: Run your modified code and report the p-value. Is it close to 1 or 0? What does that tell you?)

def benford(): # # This method returns an array of 9 entries, representing # the probabilities of having the first digit to be 1, 2, ..., 9. # # WARNING: The following just represent uniform distribution; please repalce # it with your own implementation of Benford! # return [ float(1)/len(digits) for d in digits ]

----------------------------------------------------------------------

digits = range(1, 10) # for first-digit distro, we don't consider 0

with open('SUB-EST2009_ALL.csv') as csvfile: numbers = (int(row['POPCENSUS_2000']) \ for row in csv.DictReader(csvfile) \ if row['POPCENSUS_2000'].isdigit()) firstdigits = (int(str(x)[0]) for x in numbers) counts = collections.Counter(firstdigits)

# Peform chi-squared test: freqs = [counts[d] for d in digits] total = sum(counts[d] for d in digits) theoretical_freqs = [ f * total for f in benford() ] chisq, pvalue = stats.chisquare(freqs, theoretical_freqs) print('chi-squared test p-value: ' + str(pvalue))

# Plot two distros side-by-side: fig = plt.figure() ax = fig.add_subplot(1, 1, 1) # The third parameter below is the width of the bar, set # such that we can squeeze another set of bars in: rects1 = ax.bar(digits, theoretical_freqs, 0.4, color='g') # In the following, adding the bar width to the x coordinates # position this set of bars side-by-side with the last: rects2 = ax.bar([d + 0.4 for d in digits], freqs, 0.4, color='r') ax.legend((rects1[0], rects2[0]), ('Theoretical', 'Observed')) ax.set_title('chi-squared test p-value: ' + str(pvalue))

(B) Main question: modify the code further to analyze data from the 2009 Iranian election in election-iran-2009.csv. The numbers of interest in the .csv file are the vote counts for candidates Ahmadinejad, Rezai, Karrubi, and Mousavi. Note that these numbers have commas in them. You might find Python's csv.DictReader useful.

(Next question if possible: Consider the first-digit distribution of all candidates' vote counts, as well as that of each candidate's vote counts. Compare each of these distributions to Benford Law. Which distribution gives you the lowest p-value? How does it compare with the magic p-value of 0.05, which people often use to "reject" the hypothesis that data follows the theoretical distribution?)

Sample election-iran-2009.csv:

Region,Ahmadinejad,% ,Rezai,%,Karrubi,%,Mousavi,%,Total votes,Invalid votes,Valid votes,Eligible voters,"Turnout, %" East Azerbaijan,"1,131,111",56.75,"16,920",0.85,"7,246",0.36,"837,858",42.04,"2,010,340","17,205","1,993,135","2,461,553",80.97 West Azerbaijan,"623,946",47.48,"12,199",0.93,"21,609",1.64,"656,508",49.95,"1,334,356","20,094","1,314,262","1,883,144",69.79

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions