Socrata: Analyze all the datasets
May 25th, 2015 by Tom Johnson

Socrata is a growing outfit that manages DBs for cities and non-profits.  This guy did a nice how-to page of his work with Socrata, and we thank him.!/socrata-summary/

Tracking campaign contributions with MapLight
Jun 19th, 2014 by Tom Johnson

Maplight, a 501(c)(3) foundation, recently announced its “extensive mapping project examining the geographic origin of contributions to legislators by state; contributions from companies to legislators by state; and roll call votes by state and district on key bills in Congress.”

Today’s news peg points to “Who in Your State Has Contributed Money to Majority Leader Candidate Kevin McCarthy (R-CA)?”

MapLight looks to be a good edition to our GIS toolbox.

‘Try and find Narnia in the wardrobe’: inside the work of a research specialist
Jun 5th, 2014 by Tom Johnson

Thanks to Margo Williams for passing this interview along. It’s filled with important tips and insights gained from Myers’ years of experience. Read the full interview with Myers at

“Paul Myers is an internet research specialist working in the U.K. media. He joined the BBC in 1995 as a news information researcher. This followed an earlier career in computers and internet experience dating back to the 1970s.

“These days, his role sees him organise and deliver training courses related to internet investigation, digital security, social media research, data journalism, freedom of information and reporting statistics. His techniques have helped his colleagues develop creative approaches to research, conduct their investigations securely and have led many journalists to information they would never have otherwise been able to find. He has worked with leading British T.V. & radio news, current affairs, documentaries and consumer programmes.”

Three Tuesdays workshop on data and the political campaigns at the Santa Fe Complex
Sep 27th, 2008 by Tom Johnson

Handicapping the Horserace

Published by Don Begley at 10:09 pm under Complex News, event

Handicapping the Horserace
    •September 30, 2008 – 6:30-8 pm  •October 7, 2008 – 6:30-8 pm  •October 14, 2008 – 6:30-8 pm

It’s human nature: Elections and disinformation go hand-in-hand. We idealize the competition of ideas and the process of debate while we listen to the whisper campaigns telling us of the skeletons in the other candidate’s closet. Or, we can learn from serious journalism to tap into the growing number of digital tools at hand and see what is really going on in this fall’s campaigns. Join journalist Tom Johnson for a three-part workshop at Santa Fe Complex to learn how you can be your own investigative reporter and get ready for that special Tuesday in November.

Over the course of three Tuesdays, beginning September 30, Johnson will show workshop participants how to do the online research needed to understand what’s happening in the fall political campaign. There will be homework assignments and participants will contribute to the Three Tuesdays wiki so their discoveries will be available to the general public.

Everyone is welcome but space will be limited. A suggested donation of $45 covers all three events or $20 will help produce each session. Click here to sign up.

  • The Daily Tip Sheet (September 30, 6:30 pm)

    Newspapers are a ‘morning line’ tip sheet. There isn’t enough room for what you need to know.

    Newspapers can be a good jumping-off point for political knowledge, but they rarely have enough staff, staff time and space to really drill down into a topic. Ergo, it is increasingly up to citizens to do the research to preserve democracy and help inform voters. Tonight we will be introduced to some of the city, state and national web sites to help in our reporting and to a few digital tools to help you save and retrieve what you find.
  • Swimming Against the Flow (October 7, 6:30 pm):

    How to track data to their upstream sources.

    A web page and its data are not static events. (Well, usually they are not.) Web pages and digital data all carry “signs” of where they came from, who owns the site(s) and sometimes who links to the sites. We will discuss how investigators can use these attributes to our advantage, and also take a step back to consider the “architecture of sophisticated web searching.”
  • The Payoff (October 14, 6:30 pm)

    Yup, it IS about following the money. But then what?

    Every election season, new web sites come along that make it easier to follow the money — election money. This final workshop looks at some of those sites and focuses on how to get their data into a spreadsheet. Then what? A short intro to slicing-and-dicing the numbers. (Even if you are a spreadsheet maven, please come and act as a coach.)

This workshop is NOT a sit-and-take-it-in event. We’re looking for folks who want to do some beginning hands-on (”On-line hands-on”, that is) investigation of New Mexico politics. And that means homework assignments and contributing to our Three Tuesdays wiki. Participants are also encouraged to bring a laptop if you can. Click here to sign up.

Tom Johnson’s 30-year career path in journalism is one that regularly moved from the classroom to the newsroom and back. He worked for TIME magazine in El Salvador in the mid-80s, was the founding editor of MacWEEK, and a deputy editor of the St. Louis Post-Dispatch. His areas of interest are analytic journalism, dynamic simulation models of publishing systems, complexity theory, the application of Geographic Information Systems in journalism and the impact of the digital revolution on journalism and journalism education. He is the founder and co-director of the Institute for Analytic Journalism and a member of the Advisory Board of Santa Fe Complex.


If you're really serious about searching….
Dec 5th, 2007 by Tom Johnson

Deep Web Research 2008

Bots, Blogs and News Aggregators is a keynote presentation that I have been delivering over the last several years, and much of my information comes from the extensive research that I have completed over the years into the “invisible” or what I like to call the “deep” web. The Deep Web covers somewhere in the vicinity of 900 billion pages of information located through the world wide web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. Search engines currently locate approximately 20 billion pages.

In the last several years, some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .doc, .xls, ppt, .ps. and others. These files are predominately used by businesses to communicate their information within their organization or to disseminate information to the external world from their organization. Searching for this information using deeper search techniques and the latest algorithms allows researchers to obtain a vast amount of corporate information that was previously unavailable or inaccessible. Research has also shown that even deeper information can be obtained from these files by searching and accessing the “properties” information on these files.

This article and guide is designed to give you the resources you need to better understand the history of the deep web research, as well as various classified resources that allow you to search through the currently available web to find those key sources of information nuggets only found by understanding how to search the “deep web”.

This Deep Web Research 2008 article is divided into the following sections:



Tracking the bucks all the way to court
Oct 2nd, 2006 by JTJ

Another unique investigation by The New York Times gets A1 play in this Sunday's edition (1 Oct. 2006) under the hed “Campaign Cash Mirrors a High Court's Rulings.”  Adam Liptak and Janet Roberts (who probably did the heavy lifting on the data analysis) took a long-term look at who contributed to the campaigns of Ohio's Supreme Court justices.  It ain't a pretty picture if one believes the justices should be above lining their own pockets, whether it's a campaign fund or otherwise.

In any event, there seems to be a clear correlation between contributions — and the sources — and the outcome to too many cases.  A sidebar, “Case Studies: West Virginia and Illinois,” would suggest there is much to be harvested by reporters in other states.

There is, thankfully, a fine description of how the data for the study was collected and analyzed.  See “
How Information Was Collected

There are two accompanying infographics, one  (Ruling on Contributors' Cases” ) is much more informative than the other (“While the Case Is Being Heard, Money Rolls In” ), which is a good, but confusing, attempt to illustrate difficult concepts and relationships. 

At the end of the day, though, we are grateful for the investigation, data crunching and stories.

Ver 1.0 — The beat goes on
Apr 18th, 2006 by JTJ

We're pulling together the final pieces following the Ver 1.0
workshop in Santa Fe last week.  Twenty journalists, social
scientists, computer scientists, educators, public administrators and
GIS specialists met in Santa Fe April 9-12 to consider the question,
“How can we verify data in public records databases?” 

The papers,
PowerPoint slides and some initial results of three breakout groups are
now posted for the public on the Ver1point0 group site at Yahoo.  Check it out.

SJ Mercury-News Series: "Tainted Trials, Stolen Justice."
Jan 23rd, 2006 by JTJ

Friend-of-IAJ Griff Palmer alerts us to an impressive series this week that examines the conduct of the DA's office in Santa Clara County, California.  If nothing else, the series illustrates why good, vital-to-the-community journalism takes time and is expensive.  Rick Tulsky, Griff and other colleagues spent three years — not not three days, but YEARS — on the story.  Griff writes:

I invite you all to take a look at “Tainted Trials, Stolen Justice.”
This five-day series was three years in the making. It starts in
today's Mercury News:

registration is required to view the Merc's content. I'm not sure yet
if this URL will be cumulative or will only point to each day's part.
If the latter, I'll work to get the entire package pulled together
under one URL.

The Merc's on-line presentation includes a multimedia presentation, with Flash graphics, streaming audio and streaming video.

The project's backbone is reporter Rick Tulsky's review of  every 
criminal appeal originating out of Santa Clara County Superior Court
for five years. Rick was aided in his review by staff writers Julie
Patel and Mike Zapler.

Rick has a law degree, and he used
his legal training to analyze these cases for prosectuorial er! ror,
defense error and judicial error. He went over the cases with the Santa
Clara County District Attorney's Office, defense attorneys and judges.
He recruited seasoned criminal justice scholars and former judges and
prosecutors to review his findings.

Rick's findings: Santa Clara County's criminal justice system, while
far from broken, is systemically troubled by serious flaws that bias
the system in prosecutors' favor and, in the worst cases, lead to
outright miscarriages of justice. Rick found that more than a third of
the 727 cases he analyzed were marred by some form of questionable
conduct on the part of prosecutors, defense attorneys or judges. He
found that California's Sixth Appellate District routinely found
prosecutorial and judicial error to be harmless to criminal defendants
— in dozens of instances, resorting to factual distortions and flawed
reasoning to reach their conclusions.

This analysis has at
least one serious limitation: It doesn't comp! are Rick's Santa Clara
County findings with similar data from any other jurisdiction. It would
frankly have been impossible, at least within three years, to conduct a
similar case review on a broader scale.

To help us examine how Santa Clara County's criminal justice system
differs from those of other counties, I captured 10 years' worth of
felony arrest disposition data from the Criminal Justice Statistics
Center, maintained by the California Attorney General's Office.  (
I hand-keyed another four years' worth of CJSC data that were available
only on paper. (I did a rough estimate at one point and determined that
I'd keyed in somewhere in the neighborhood of 10,000 cells of data.)

This analysis showed us that, within the accuracy limitations of the
CJSC data, Santa Clara County stood out for having one of the highest
conviction rates and one of the lowest judicial dismissal rates among
all counties with populations of ! 100,000 or more.

As Rick's attention turned to the the appellate
system, my attention was drawn to an interactive database system
maintained by the California Administrative Office of the Courts:

requesed a copy of the underlying database from the AOC, only to be
stonewalled. Months of effort on our attorneys' part yielded only one
summary spreadsheet from the AOC.

Thanks to discussions on
this list and at NICAR conferences, I knew it should be possible to
programmatically retrieve the contents of the AOC database.  With Aron
Pilhofer's and John Perry's Perl scripting tutorials, and with lots of
generous coaching from John, I put together scripts that harvested the
criminal appeals data from the AOC system and parsed it from HTML into
delimited files.”

That data retrieval underlies the numbers that appear in the final day of this series.

Resources related to Crime Mapping
Dec 7th, 2005 by Tom Johnson

don't know if there has as yet been any empirical research done on how
interested media consumers are in online crime mapping — and how good the coverage is —  but there is a body of
literature debating readers' interest in crime per se.  It would
seem to be a pretty good bet, though, that if people are interested in
crime AND if more and more are going online via broadband, that
some dynamic crime maps would get some hits. 

that crime mapping is not just about pushing digital push-pins on a
map, GoogleMap or otherwise.  “Journey to Crime” maps or maps
showing where a car was stolen and when it was recovered can provide
interesting insights.

Here are some links recently posted to the CrimeMapping listserv that could be of value to journalists:

Journey-after-crime: How Far and to Which Direction DO They Go?

Linking Offender Residence Probability Surfaces to a Specific Incident Location

Journey to Crime Estimation

Applications for Examining the Journey-to-Crime Using Incident-Based Offender Residence Probability Surfaces

The Geography of Transit Crime:


Yes, Virginia, methodology DOES matter
Nov 10th, 2005 by JTJ

A piece on calling the elections in Detroit:

MAKING A FORECAST: A secret formula helps producer call the election right



November 10, 2005

What was a viewer to believe?

As polls closed Tuesday, WDIV-TV (Channel 4) declared Freman Hendrix winner of Detroit's mayoral race by 10 percentage points.

WXYZ-TV (Channel 7) showed Hendrix ahead by 4 percentage points, statistically too close to call.

But WJBK-TV (Channel 2) got it right, declaring just after 9 p.m. that
Mayor Kwame Kilpatrick was ahead, 52% to 48%, which turned out to be
almost exactly the final 53%-47% outcome declared many hours later.

And it was vote analyst Tim Kiska who nailed it for WJBK, and for WWJ-AM radio, using counts from 28 of 620 Detroit precincts.

Kiska did it with help from Detroit City Clerk Jackie Currie. She
allowed a crew that Kiska assembled to collect the precinct tallies
shortly after the polls closed at 8 p.m.

Using what he calls a secret formula, Kiska calculated how those 28 precincts would predict the result citywide.

His formula also assumed that absentee voters chose Hendrix over Kilpatrick by a 2-1 ratio.

That's different from the methods of pollsters who got it wrong
Tuesday, Steve Mitchell for WDIV and EPIC/MRA's Ed Sarpolus for WXYZ
and the Free Press. Both men used telephone polls, calling people at
home during the day and evening and asking how they voted.

It's a more standard method of election-day polling, but Tuesday proved treacherous.

Kiska, a former reporter for the Free Press and Detroit News, has done
such election-day predictions since 1974, but said he was nervous

“Every time I go into one of these, my nightmare is I might get it
wrong,” said Kiska, a WWJ producer. “I had a bad feeling about this
going in. I thought there was going to be a Titanic hitting an iceberg
and hoping it wouldn't be me.”

Kiska said he especially felt sorry for his friend Mitchell.

Mitchell said he's been one of the state's most accurate political
pollsters over 20 years, but said his Tuesday survey of 800 voters
turned out to be a bad sample.

He said polling is inherently risky, and that even well-conducted polls
can be wrong one out of 20 times. “I hit number 20 this time.”

For Sarpolus, it's the second Detroit mayoral race that confounded his
polls. He was the only major pollster in 2001 who indicated Gil Hill
would defeat Kilpatrick.

Sarpolus said the pressure to get poll results on the air quickly made
it impossible to adjust his results as real vote totals were made
public during the late evening.

Of Kiska, Sarpolus said: “You have to give him credit. … But you have to assume all city clerks are willing to cooperate.”

Contact CHRIS CHRISTOFF at 517-372-8660 or

»  Substance:WordPress   »  Style:Ahren Ahimsa