More on Benford's Law
Jul 30th, 2007 by JTJ

We've long been intrigued with Benford's Law and its potential for Analytic Journalism.  Today we ran across a new post by Charley Kyd that explains both the Law and presents some clear formulas for its application.

An Excel 97-2003 Tutorial:

Use Benford's Law with Excel

To Improve Business Planning

Benford's Law addresses an amazing characteristic of data. Not only does his formula help to identify fraud, it could help you to improve your budgets and forecasts.

by Charley Kyd

July, 2007

(Email Comments)

(Follow this link for the Excel 2007 version.)

Unless you're a public accountant, you probably haven't experimented with Benford's Law.

Auditors sometimes use this fascinating statistical insight to uncover fraudulent accounting data. But it might reveal a useful strategy for investing in the stock market. And it might help you to improve the accuracy of your budgets and forecasts.

This article will explain Benford's Law, show you how to calculate it with Excel, and suggest ways that you could put it to good use.

From a hands-on-Excel point of view, the article describes new uses for the SUMPRODUCT function and discusses the use of local and global range names.  [Read more…]



The Beauty of Statistics
Jul 11th, 2007 by JTJ

FYI: From the O'Reilly Radar

Unveiling the Beauty of Statistics

Posted: 11 Jul 2007 03:01 AM CDT

By Jesse Robbins

I presented last week at the OECD World Forum in Istanbul along with Professor Hans Rosling, Mike Arrington, John Gage and teams from MappingWorlds, Swivel (disclosure: I am an adviser to Swivel) and Many Eyes. We were the “Web2.0 Delegation” and it was an incredible experience.

The Istanbul Declaration signed at the conference calls for governments to make their statistical data freely available online as a “public good.” The declaration also calls for new measures of happiness and well-being, going beyond just economic output and GDP. This requires the creation of new tools, which the OECD envisions will be “wiki for progress.” Expect to hear more about these initiatives soon.

This data combined with new tools like Swivel and MappingWorlds is powerful. Previously this information was hard to acquire and the tools to analyze it were expensive and hard to use, which limited it's usefulness. Now, regular people can access, visualize and discuss this data. Creating an environment where knowledge can be shared and explored.

H.G. Wells predicted that “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read or write.” Proponents of specific public policies often use statistics to support their view. They have the ability to select the data to fit with the policy. Democratization of statistics allows citizens to see the data that doesn't fit the policy, giving the public the power to challenge policymakers with new interpretations.

I highly recommend you watch Professor Rosling's exceptional summary of these exciting changes (where I got the title for this post), as well as his talks at TED.”



NYT needs to install a "math checker" on every copy editor's desk
May 27th, 2007 by JTJ

This weekend, friend-of-the-IAJ Joe Traub sent the following to the editor of the New York Times.  Here's the story Joe is talking about: “White House….

To the Editor:

The headline on page 1 on May 26 states
“White House Said to Debate '08 Cut in Troops by 50%”
The article reports a possible reduction to 100,000 troops
from 146,000. Thats 31.5%, not 50%. NPR's Morning Edition
picked up the story from the NYT and also reported 50%

Joseph F. Traub
The writer is a Professor of Computer Science at Columbia University

The headline error is bad enough (it's only in the hed, not not in the story) — and should be a huge embarrassment to the NYT.  But the error gets compounded because while the Times no longer sets the agenda for the national discussion, it is still thought of (by most?) as the paper of record.  Consequently, as other colleagues have pointed out, the reduction percentage gets picked up by other journalists who don't bother to do the math (or who cannot do the math.)
See, for example:
* CBS News — Troop Retreat In '08?” — (This video has a shot of the NYT story even though the percentage is not mentioned.  Could it be that the TV folks don't think viewers can do the arithmetic?)
(NB: We could not yet find on the NPR site the transcript of the radio story that picked up the 50 percent error.  But run a Google search with “cut in Troops by 50%” and note the huge number of bloggers who also went with the story without doing the math.)

Colleague Steve Doig has queried the reporter of the piece, David Sanger, asking if the mistake is that of the NYT or the White House.  No answer yet received, but Doig later commented: “Sanger's story did talk about reducing brigades from 20 to 10. That's
how they'll justify the “50% reduction” headline, I guess, despite the
clear reference higher up to cutting 146,000 troops to 100,000.”

Either way, it is a serious blunder of a fundamental sort on an issue most grave.  It should have been caught, but then most journalists are WORD people and only word people, we guess.

We would also point out the illogical construction that the NYT uses consistently in relaying statistical change over time.  To wit: “… could lower troop levels by the midst of the 2008 presidential election to roughly 100,000, from about 146,000…”  We wince. 

English is read from left to right.  Most English calendars and horizontal timelines are read from left to right.  When writing about statistical change, the same convention should be followed: oldest dates and data precedes newest or future dates and data.  Therefore, this should best be written: “…could lower troop levels from about 146,000 to roughly 100,000 by the midst of the 2008 presidential election.”

Hey, bunky, you say you need a story for tomorrow, and the well is dry
Jan 2nd, 2007 by JTJ

No story?  Then check out Swivel, a web site rich with data — and the display of data — that you didn't know about and which is pregnant with possibilities for a good news feature.  And often a news feature that could be localized.

Here, for example, is a posting from the SECRECY REPORT CARD 2005  illustrating the changing trends in the the classification and de-classification of U.S. government data.  (You can probably guess the direction of the curves.)

Spotlight What is the US Government Not Telling Us?

number of classified documents is steadily increasing, while the number
of pages being declassified is dwindling. This data were uploaded by mcroydon.

The Quick and the Dead
Nov 9th, 2006 by JTJ

Paul Parker, of the Providence (Rhode Island) Journal, is the Quick and an impressive list of folks on the state's voter registration rolls are the Dead this week.  Below is a note Parker posted to  the NICAR-L listserv.  The great thing about this is the recipe Parker provides for an analytic journalists' cookbook.  Said he:

Nothing new or innovative, but we ran a dead voters story today, and
it's getting tons of buzz. I would recommend — no, URGE — everyone on
the list do the same for your area.

Here's the link:

I know it's CAR101, but I'll outline how we did it (which is also
explained in the story):

1. Get your state's central voter registration database.
2. Get your state slice of the Social Security Administration's Death
Master File from IRE/NICAR.
3. Run a match on First Name, Last Name and Date of Birth.
4. Exclude matches where middle initials conflict. (Allow P=PETER or
P=NULL, but not P=G.)
5. Calculate a per capita rate for each city/town by dividing the number
of dead people by the total registered.
6. Interview the biggest offenders about why they're the biggest offenders.

This was so easy, and now everyone at the paper thinks I'm some sort of
journalism deity. (And the voter registration people called to ask,
“Where do I get a copy of that Social Security list.”)

As for the possibility of false positives, we pointed this out in the
story, which I think sufficed because the odds are low enough. I also
hand checked a few against our obituary archives.

Paul Parker
The Providence Journal
75 Fountain Street
Providence, RI 02902

Then David Heath, at the Seattle Times layered in his experience.  Said he:

did a dead-voter story last year after a squeeker of a governor's race.

story looked for dead people actually voting. At first, we were

surprised by
the number of matches. But very few of them withstood

scrutiny. Matching a
name and a birthdate will get you lots of false

matches. You really need to
include address, which you can do in our

state where the death-certificate
database is public.

We then went to the county election board and got the
actual page voters

signed when they voted. We even looked
at absentee ballots. What

we discovered were a lot of cases where a
vote was recorded for a person

because someone else accidentally signed the
wrong line on the page —

John R. Smith signing on John P. Smith's line, for
example. Or cases

where the person scanning the data with a bar-code reader
into the

database missed and scanned the wrong line. We also found cases

parents and children had the same name; the parent died but the son

daughter was mistakenly scrubbed from the registry.

We did find a
few cases of dead people voting. Usually it was a recent

death and someone in
the family turned in an absentee ballot and forged

the signature. But you
have to be careful that a story about dead voters

isn't really a story about
dirty data.

David Heath
The Seattle

Teasing out attitudes from text
Oct 5th, 2006 by JTJ

Eric Lipton has a piece in Wedneday's (4 Oct. 2006) NYTimes about some “new” research efforts to come up with software “that would let the [U.S.] government monitor negative opinions of the United States or its leaders in newspapers and other publications overseas.”  (See “Software Being Developed to Monitor Opinions of U.S.“)  Surely this is an interesting problem, and one made especially difficult when the translation factor kicks in. 

This is not, however, the first attempt to gin-up such software.  We have long admired the work done some years ago at the Pacific Northwest National Laboratory in the ThemeRiver™ visualization.

It “…helps users identify time-related
patterns, trends, and relationships across a large collection of
documents. The themes in the collection are represented by a 'river'
that flows left to right through time. The river widens or narrows to
depict changes in the collective
strength of selected themes in the
underlying documents. Individual themes are represented as colored 'currents' flowing within the river. The theme currents narrow or widen
to indicate changes in individual theme strength at any point in time.
  Status: An interactive proof of concept prototype has been developed. Download a QuickTime video about ThemeRiver (20MB)

We hope the PNNL will continue by giving us more of this intriguing tool.

Tracking the bucks all the way to court
Oct 2nd, 2006 by JTJ

Another unique investigation by The New York Times gets A1 play in this Sunday's edition (1 Oct. 2006) under the hed “Campaign Cash Mirrors a High Court's Rulings.”  Adam Liptak and Janet Roberts (who probably did the heavy lifting on the data analysis) took a long-term look at who contributed to the campaigns of Ohio's Supreme Court justices.  It ain't a pretty picture if one believes the justices should be above lining their own pockets, whether it's a campaign fund or otherwise.

In any event, there seems to be a clear correlation between contributions — and the sources — and the outcome to too many cases.  A sidebar, “Case Studies: West Virginia and Illinois,” would suggest there is much to be harvested by reporters in other states.

There is, thankfully, a fine description of how the data for the study was collected and analyzed.  See “
How Information Was Collected

There are two accompanying infographics, one  (Ruling on Contributors' Cases” ) is much more informative than the other (“While the Case Is Being Heard, Money Rolls In” ), which is a good, but confusing, attempt to illustrate difficult concepts and relationships. 

At the end of the day, though, we are grateful for the investigation, data crunching and stories.

Statistically speaking….
Sep 20th, 2006 by Tom Johnson

Any discipline always has subsets of argument, typically about definitions, methodologies, process or significance.  Statistics, of course, is no different.  Below is an interesting article from the Washington Monthly about what constitutes statistical significance.  The article is OK, but the commentary below it even better.  See

LIES, DAMN LIES, AND….Via Kieran Healy, here's something way off the beaten path: a new paper by Alan Gerber and Neil Malhotra titled “Can political science literatures be believed? A study of publication bias in the APSR and the AJPS.”
It is, at first glance, just what it says it is: a study of publication
bias, the tendency of academic journals to publish studies that find
positive results but not to publish studies that fail to find results.
The reason this is a problem is that it makes positive results look
more positive than they really are. If two researchers do a study, and
one finds a significant result (say, tall people earn more money than
short people) while the other finds nothing, seeing both studies will
make you skeptical of the first paper's result. But if the only paper
you see is the first one, you'll probably think there's something to it.

The chart on the right shows G&M's basic result. In statistics
jargon, a significant result is anything with a “z-score” higher than
1.96, and if journals accepted articles based solely on the quality of
the work, with no regard to z-scores, you'd expect the z-score of
studies to resemble a bell curve. But that's not what Gerber and
Malhotra found.
a z-score of 1.96 there are far fewer studies than you'd expect.
Apparently, studies that fail to show significant results have a hard
time getting published.

Major Crime Mapping Conference (2007) Call for Papers
Sep 20th, 2006 by Tom Johnson

Eight or nine years back we attended one of the first Crime Mapping conferences sponsored by the National Institute of Justice and found it to be one of the most creative and practical events of this type.  (We also have very high regard for the ESRI Users Conference and the Special Libraries Assoc. meetings.)  So we want to be sure to let all analytic journos know about next year's Crime Mapping confab, scheduled for March 28 to 31, 2007 in Pittsburg, Penn.  Here's part of the official call for papers:

The Mapping & Analysis for Public Safety Program announces it's Call 
for Papers for the Ninth Crime Mapping Research Conference in Pittsburgh,
PA at the Omni William Penn Hotel, March 28 to 31, 2007. The deadline
for submission is Friday, September 29th....

The theme of this conference will be Spatial Approaches to
Understanding Crime & Demographics. The use of Geographic Information Systems
(GIS) and spatial data analysis techniques have become prominent tools for
analyzing criminal behavior and the impacts of the criminal justice
system on society. Classical and spatial statistics have been merged to
form more comprehensive approaches in understanding social problems
from research and practical standpoints. These methods allow for the
measurement of proximity effects on places by neighboring areas that lead
to a multi-dimensional and less static understanding of factors that
contribute to or repel crime across space.

The 9th Crime Mapping Research Conference will be about demonstrating
the use and development of methodologies for practitioners and
researchers. The MAPS Program is anticipating the selection of key accepted
presentations for further development of an electronic monograph on GIS,
Spatial Data Analysis and the Study of Crime in the following year. Its
purpose will be to demonstrate the fusing of classical and spatial
analysis techniques to enhance policy decisions. Methods should not be
limited to the use of classical and spatial statistics but also
demonstrate the unique capabilities of GIS in preparing, categorizing and
visualization data for analysis....

Those were the days — the early days — of Social Network Analysis
Jul 28th, 2006 by Tom Johnson

Least any of us think that Social Network Analysis is something new, please take the time to read this wonderful, albeit personal, history of the field.   Edward O. Laumann, of the University of Chicago, has been swimming in these waters for more than 40 years.  His address to the International Network of Social Network Analysis, 26th Annual Sunbelt Conference in Vancouver, Canada, April 2006, tells much about how we have arrived at the current level of SNA

See “A 45-Year Retrospective of Doing Networks”

»  Substance:WordPress   »  Style:Ahren Ahimsa