SoCal fire maps
Oct 24th, 2007 by Tom Johnson

Today, literally hundreds of square kilometers of Southern California — Los Angeles to San Diego — are burning. Some very alert newspapers and radio stations, though, are using Google Maps and a program called Twitter ( to update the maps on a regular basis. A good example, I think, of applied tools of analytic journalism.

Southern California fires on Google Maps


More on Benford's Law
Jul 30th, 2007 by JTJ

We've long been intrigued with Benford's Law and its potential for Analytic Journalism.  Today we ran across a new post by Charley Kyd that explains both the Law and presents some clear formulas for its application.

An Excel 97-2003 Tutorial:

Use Benford's Law with Excel

To Improve Business Planning

Benford's Law addresses an amazing characteristic of data. Not only does his formula help to identify fraud, it could help you to improve your budgets and forecasts.

by Charley Kyd

July, 2007

(Email Comments)

(Follow this link for the Excel 2007 version.)

Unless you're a public accountant, you probably haven't experimented with Benford's Law.

Auditors sometimes use this fascinating statistical insight to uncover fraudulent accounting data. But it might reveal a useful strategy for investing in the stock market. And it might help you to improve the accuracy of your budgets and forecasts.

This article will explain Benford's Law, show you how to calculate it with Excel, and suggest ways that you could put it to good use.

From a hands-on-Excel point of view, the article describes new uses for the SUMPRODUCT function and discusses the use of local and global range names.  [Read more…]



The Beauty of Statistics
Jul 11th, 2007 by JTJ

FYI: From the O'Reilly Radar

Unveiling the Beauty of Statistics

Posted: 11 Jul 2007 03:01 AM CDT

By Jesse Robbins

I presented last week at the OECD World Forum in Istanbul along with Professor Hans Rosling, Mike Arrington, John Gage and teams from MappingWorlds, Swivel (disclosure: I am an adviser to Swivel) and Many Eyes. We were the “Web2.0 Delegation” and it was an incredible experience.

The Istanbul Declaration signed at the conference calls for governments to make their statistical data freely available online as a “public good.” The declaration also calls for new measures of happiness and well-being, going beyond just economic output and GDP. This requires the creation of new tools, which the OECD envisions will be “wiki for progress.” Expect to hear more about these initiatives soon.

This data combined with new tools like Swivel and MappingWorlds is powerful. Previously this information was hard to acquire and the tools to analyze it were expensive and hard to use, which limited it's usefulness. Now, regular people can access, visualize and discuss this data. Creating an environment where knowledge can be shared and explored.

H.G. Wells predicted that “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read or write.” Proponents of specific public policies often use statistics to support their view. They have the ability to select the data to fit with the policy. Democratization of statistics allows citizens to see the data that doesn't fit the policy, giving the public the power to challenge policymakers with new interpretations.

I highly recommend you watch Professor Rosling's exceptional summary of these exciting changes (where I got the title for this post), as well as his talks at TED.”



The NYT DOES run a correction on its percentage screw-up
May 28th, 2007 by JTJ

So the NYT did backtrack on the percent-of-change error described yesterday without assigning blame.  That's fine.  But the correction suggests another big story that we have only seen parts of.  That is, of all the U.S. presence in Iraq — military and contractors — how many and what proportion are actually on the streets and how many and in what capacity are in support categories. 

[New York Times] Corrections: For the Record
Published: May 28, 2007 [Monday]
A front-page headline on Saturday about
concepts being developed by the Bush administration to reduce United
States combat forces in Iraq by as much as half next year referred
imprecisely to the overall effect on troop levels. As the story
indicated, removing half of the 20 combat brigades now in Iraq by the
end of 2008, one of the ideas under consideration, would cut the total
number of troops there by about one-third, from 146,000 to roughly
100,000, not by 50 percent. That is because many of the troops that
would remain in Iraq are in training or support units, not in combat
forces. (Go to Article)

NYT needs to install a "math checker" on every copy editor's desk
May 27th, 2007 by JTJ

This weekend, friend-of-the-IAJ Joe Traub sent the following to the editor of the New York Times.  Here's the story Joe is talking about: “White House….

To the Editor:

The headline on page 1 on May 26 states
“White House Said to Debate '08 Cut in Troops by 50%”
The article reports a possible reduction to 100,000 troops
from 146,000. Thats 31.5%, not 50%. NPR's Morning Edition
picked up the story from the NYT and also reported 50%

Joseph F. Traub
The writer is a Professor of Computer Science at Columbia University

The headline error is bad enough (it's only in the hed, not not in the story) — and should be a huge embarrassment to the NYT.  But the error gets compounded because while the Times no longer sets the agenda for the national discussion, it is still thought of (by most?) as the paper of record.  Consequently, as other colleagues have pointed out, the reduction percentage gets picked up by other journalists who don't bother to do the math (or who cannot do the math.)
See, for example:
* CBS News — Troop Retreat In '08?” — (This video has a shot of the NYT story even though the percentage is not mentioned.  Could it be that the TV folks don't think viewers can do the arithmetic?)
(NB: We could not yet find on the NPR site the transcript of the radio story that picked up the 50 percent error.  But run a Google search with “cut in Troops by 50%” and note the huge number of bloggers who also went with the story without doing the math.)

Colleague Steve Doig has queried the reporter of the piece, David Sanger, asking if the mistake is that of the NYT or the White House.  No answer yet received, but Doig later commented: “Sanger's story did talk about reducing brigades from 20 to 10. That's
how they'll justify the “50% reduction” headline, I guess, despite the
clear reference higher up to cutting 146,000 troops to 100,000.”

Either way, it is a serious blunder of a fundamental sort on an issue most grave.  It should have been caught, but then most journalists are WORD people and only word people, we guess.

We would also point out the illogical construction that the NYT uses consistently in relaying statistical change over time.  To wit: “… could lower troop levels by the midst of the 2008 presidential election to roughly 100,000, from about 146,000…”  We wince. 

English is read from left to right.  Most English calendars and horizontal timelines are read from left to right.  When writing about statistical change, the same convention should be followed: oldest dates and data precedes newest or future dates and data.  Therefore, this should best be written: “…could lower troop levels from about 146,000 to roughly 100,000 by the midst of the 2008 presidential election.”

Organizing the data; organizing the visualization
Jan 23rd, 2007 by JTJ

Thanks to our friend at the University de Zulia in Maracaibo, Prof.
Maria-Isabel Neuman, we just learned about this Rosetta Stone of data

This is a must-see:  “A Periodic Table of Visualization Methods.”

These guys in Switzerland at the Visual-Literacy Project have pulled together, in a
wonderfully coherent fashion,  the multiple concepts that many of us
have been working on for years. 

Be sure to also take a look at the
paper by Lengler and Eppler at the bottom of the “Maps” page.
It's a good, tight explanation of what they are up to.  We like their definition:

“A visualization method is a systematic, rule-based, external, permanent, and graphic representation that depicts information in a way that is conducive to acquiring insights, developing an elaborate understanding, or communicating experiences.”

But we're not so sure that “permanent” is crucial or should even be included.  If they are referring to “method,” then that would seem to limit the opportunity for refinements over time.  And if they are talking about the resulting displays of data, might not that reduce the possibility of dynamic data displays, say real-time traffic flows or changes in the stock market?  Simulations?  Oh, well, a refinement ripe for discussion.

Hey, bunky, you say you need a story for tomorrow, and the well is dry
Jan 2nd, 2007 by JTJ

No story?  Then check out Swivel, a web site rich with data — and the display of data — that you didn't know about and which is pregnant with possibilities for a good news feature.  And often a news feature that could be localized.

Here, for example, is a posting from the SECRECY REPORT CARD 2005  illustrating the changing trends in the the classification and de-classification of U.S. government data.  (You can probably guess the direction of the curves.)

Spotlight What is the US Government Not Telling Us?

number of classified documents is steadily increasing, while the number
of pages being declassified is dwindling. This data were uploaded by mcroydon.

Yup, that time of the decade is again fast approaching
Oct 26th, 2006 by JTJ

FYI, folks:

Cynthia Taeuber will present her online course “Using the Census's
American Community Survey (ACS)” at Nov.17 – Dec. 15.
She will be available for questions and comments on a private
discussion board throughout this period.

Prior to 2006, analysts had to make do with increasingly out-of-date
detailed information about households and individuals while they waited
for the next decennial census. Starting in 2006, this information will
be made available on an annual basis in the ACS.

This course shows what sort of information is included, how to obtain
it, and what methodological and sample size issues present themselves.

If you have not made use of similar Census data previously, learn how
you can leverage these improvements in data currency and timeliness for
your projects.  If you have used decennial census data before, you will
benefit by learning about the methodological differences between this
Survey and the decennial census long form – they affect the results and
you may make errors if you don't know how to handle the differences.

Ms. Taeuber, a senior policy advisor at the University of Baltimore's
Jacob France Institute, has 30 years of experience at the U.S. Census
Bureau, directed the analytic staff for the American Community Survey,
and received the Commerce Dept.'s Gold Medal Award for her innovative
work on the American Community Survey.  She is the author of “The
American Community Survey:  Updated Information for America's
Communities,” and more.

As with all online courses at, there are no set hours
when you must be online; we estimate you will need 7-15 hours per week.


Peter Bruce

P.S.  Also coming up:

Nov. 3 – Cluster Analysis (useful for customer segmentation)
Nov. 17 – How to deal with missing data
Nov. 27 – Basic Concepts in Probability and Statistics
612 N. Jackson St.
Arlington, VA 22201

Tracking the bucks all the way to court
Oct 2nd, 2006 by JTJ

Another unique investigation by The New York Times gets A1 play in this Sunday's edition (1 Oct. 2006) under the hed “Campaign Cash Mirrors a High Court's Rulings.”  Adam Liptak and Janet Roberts (who probably did the heavy lifting on the data analysis) took a long-term look at who contributed to the campaigns of Ohio's Supreme Court justices.  It ain't a pretty picture if one believes the justices should be above lining their own pockets, whether it's a campaign fund or otherwise.

In any event, there seems to be a clear correlation between contributions — and the sources — and the outcome to too many cases.  A sidebar, “Case Studies: West Virginia and Illinois,” would suggest there is much to be harvested by reporters in other states.

There is, thankfully, a fine description of how the data for the study was collected and analyzed.  See “
How Information Was Collected

There are two accompanying infographics, one  (Ruling on Contributors' Cases” ) is much more informative than the other (“While the Case Is Being Heard, Money Rolls In” ), which is a good, but confusing, attempt to illustrate difficult concepts and relationships. 

At the end of the day, though, we are grateful for the investigation, data crunching and stories.

Statistically speaking….
Sep 20th, 2006 by Tom Johnson

Any discipline always has subsets of argument, typically about definitions, methodologies, process or significance.  Statistics, of course, is no different.  Below is an interesting article from the Washington Monthly about what constitutes statistical significance.  The article is OK, but the commentary below it even better.  See

LIES, DAMN LIES, AND….Via Kieran Healy, here's something way off the beaten path: a new paper by Alan Gerber and Neil Malhotra titled “Can political science literatures be believed? A study of publication bias in the APSR and the AJPS.”
It is, at first glance, just what it says it is: a study of publication
bias, the tendency of academic journals to publish studies that find
positive results but not to publish studies that fail to find results.
The reason this is a problem is that it makes positive results look
more positive than they really are. If two researchers do a study, and
one finds a significant result (say, tall people earn more money than
short people) while the other finds nothing, seeing both studies will
make you skeptical of the first paper's result. But if the only paper
you see is the first one, you'll probably think there's something to it.

The chart on the right shows G&M's basic result. In statistics
jargon, a significant result is anything with a “z-score” higher than
1.96, and if journals accepted articles based solely on the quality of
the work, with no regard to z-scores, you'd expect the z-score of
studies to resemble a bell curve. But that's not what Gerber and
Malhotra found.
a z-score of 1.96 there are far fewer studies than you'd expect.
Apparently, studies that fail to show significant results have a hard
time getting published.

»  Substance:WordPress   »  Style:Ahren Ahimsa