Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 09, 2024

Testing against Benford's Law Suppose that you measure some naturally-occurring phenomenon, such as the land area of cities or lakes, or the price of stocks

Testing against Benford's Law

Suppose that you measure some naturally-occurring phenomenon, such as the land area of cities or lakes, or the price of stocks on the stock market. You can plot a histogram of of the first digits of your measurements, showing how frequent each first digit (1-9) is.

Your measurements were made in some arbitrary units, such as square miles or dollars. What if you made the measurements in some other units, such as acres or euros? You would not expect the shape of the histogram to differ just because you changed the units of measurement, because the units are arbitrary and unrelated to the underlying phenomenon. If the shape of the histogram did change, that could indicate that the data were fraudulent. This approach has been used to detect fraud, especially in accounting and economics but also in science.

Data are called scale-invariant if measuring them in any scale (say, meters vs. feet vs. inches) yields similar results, such as the same histogram of first digits. Many natural processes produce scale-invariant data. Benford's Law states the remarkable fact there is only one possible histogram that results from all of these processes! Let P(d)

be the probability of the given digit d{1,2,,9} being the first one in a measurement. Then, we have P(d)=log10(d+1)log10(d)

_NOTE: Benford's law only holds when the data has no natural limit nor cutoff. For example, it would not hold for grades in a class (which are in the range 0% to 100% or 0.0 to 4.0) nor for people's height in centimeters (where almost every value would start with 1 or 2).

(A) Below is the Python code that parses the population data, plots the distribution of their first digits side-by-side with the "theoretical" distribution, and performs hypothesis testing (using a 2 test) to know how well the observed first-digit frequencies conform to the theoretical distribution.

But the p-value computed by the supplied code is WRONG! The theoretical distribution in this case should be governed by Benford's Law, but the current implementation in the benford() function uses the uniform distribution.

Main question: modify the benford() function so it reflects Benford's Law instead.

(Next question if possible: Run your modified code and report the p-value. Is it close to 1 or 0? What does that tell you?)

def benford(): # # This method returns an array of 9 entries, representing # the probabilities of having the first digit to be 1, 2, ..., 9. # # WARNING: The following just represent uniform distribution; please repalce # it with your own implementation of Benford! # return [ float(1)/len(digits) for d in digits ]

----------------------------------------------------------------------

digits = range(1, 10) # for first-digit distro, we don't consider 0

with open('SUB-EST2009_ALL.csv') as csvfile: numbers = (int(row['POPCENSUS_2000']) \ for row in csv.DictReader(csvfile) \ if row['POPCENSUS_2000'].isdigit()) firstdigits = (int(str(x)[0]) for x in numbers) counts = collections.Counter(firstdigits)

# Peform chi-squared test: freqs = [counts[d] for d in digits] total = sum(counts[d] for d in digits) theoretical_freqs = [ f * total for f in benford() ] chisq, pvalue = stats.chisquare(freqs, theoretical_freqs) print('chi-squared test p-value: ' + str(pvalue))

# Plot two distros side-by-side: fig = plt.figure() ax = fig.add_subplot(1, 1, 1) # The third parameter below is the width of the bar, set # such that we can squeeze another set of bars in: rects1 = ax.bar(digits, theoretical_freqs, 0.4, color='g') # In the following, adding the bar width to the x coordinates # position this set of bars side-by-side with the last: rects2 = ax.bar([d + 0.4 for d in digits], freqs, 0.4, color='r') ax.legend((rects1[0], rects2[0]), ('Theoretical', 'Observed')) ax.set_title('chi-squared test p-value: ' + str(pvalue))

(B) Main question: modify the code further to analyze data from the 2009 Iranian election in election-iran-2009.csv. The numbers of interest in the .csv file are the vote counts for candidates Ahmadinejad, Rezai, Karrubi, and Mousavi. Note that these numbers have commas in them. You might find Python's csv.DictReader useful.

(Next question if possible: Consider the first-digit distribution of all candidates' vote counts, as well as that of each candidate's vote counts. Compare each of these distributions to Benford Law. Which distribution gives you the lowest p-value? How does it compare with the magic p-value of 0.05, which people often use to "reject" the hypothesis that data follows the theoretical distribution?)

Sample election-iran-2009.csv:

Region,Ahmadinejad,% ,Rezai,%,Karrubi,%,Mousavi,%,Total votes,Invalid votes,Valid votes,Eligible voters,"Turnout, %" East Azerbaijan,"1,131,111",56.75,"16,920",0.85,"7,246",0.36,"837,858",42.04,"2,010,340","17,205","1,993,135","2,461,553",80.97 West Azerbaijan,"623,946",47.48,"12,199",0.93,"21,609",1.64,"656,508",49.95,"1,334,356","20,094","1,314,262","1,883,144",69.79

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

=+2.15. On the field $o in (0, 1] define P( A) to be 1 or 0 according as there does or does not exist some positive 4 (depending on A) such that A contains the interval ( , + 4]. Show that P is...

Answered: 1 week ago

Question

★★★★★

TL Division of Giant Bank has assets of $14.4 billion. During the past year, the division had profits of $1.8 billion. Giant Bank has a cost of capital of 8 percent. Ignore taxes. Required a. Compute...

Answered: 1 week ago

Question

★★★★★

Please explain how you solved the question too. I WILL UPVOTE!!! 2. Write an assembly program that adds the two hexadecimal numbers 20000 and 30000h and then subtracts 10000h from the summation...

Answered: 1 week ago

Question

★★★★★

Suppose that you invest $ 300 per month in a sa account for the next 10 years which earns 0.1 pe what is the value of the investment today

Answered: 1 week ago

Question

★★★★★

Q-2) In order to comment on whether the design specifications are being matched or not, perform relevant hypothesis tests and calculate the p-value for each. What will you conclude? Assume you are...

Answered: 1 week ago

Question

★★★★★

A Cancer Testing Center has been presented with a proposal to offer a new diagnostic procedure that would require the purchase of a specialized piece of equipment.This piece of equipment offers...

Answered: 1 week ago

Question

★★★★★

For the dependent variable Y and the independent variables X1 and X2, the linear regression model is given by: Y=0.08059*X1-0.16109*X2+5.26570. Complete the following table: Actual Y X1 X2 Predicted...

Answered: 1 week ago

Question

★★★★★

For the beam and loading shown, use discontinuity functions to compute: (a) the deflection VA of the beam at A, and (b) the deflection Vmidspan of the beam at midspan (i.e., x = 2.7 m). Assume a...

Answered: 1 week ago

Question

★★★★★

Which of the following is NOT an external trend that affects the use of mobile marketing? Multiple Choice People use mobile devices multiple times a day. Smartphone and tablets are not as popular as...

Answered: 1 week ago

Previous Question Next Question