"Distributed data analysis"? Potentially.
Aug 3rd, 2009 by analyticjournalism

FYI from O'Reilly Radar

And does this suggest possibility of something like “distributed data analysis” whereby a number of widely scattered watchdogs could be poking into the same data set?  If so, raises interesting questions for journalism educators: who is developing the tools to manage such investigations?

Enabling Massively Parallel Mathematics Collaboration — Jon Udell writes about Mike Adams whose WordPress plugin to grok LaTeX formatting of math has enabled a new scale of mathematics collaboration.


In February 2007, Mike Adams, who had recently joined Automattic, the company that makes WordPress, decided on a lark to endow all blogs running on with the ability to use LaTeX, the venerable mathematical typesetting language. So I can write this:

$latex \pi r^2$

And produce this:

\pi r^2

When he introduced the feature, Mike wrote:

Odd as it may sound, I miss all the equations from my days in grad school, so I decided that what needed most was a hot, niche feature that maybe 17 people would use regularly.

A whole lot more than 17 people cared. And some of them, it turns out, are Fields medalists. Back in January, one member of that elite group — Tim Gowers — asked: Is massively collaborative mathematics possible? Since then, as reported by observer/participant Michael Nielsen (1, 2), Tim Gowers, Terence Tao, and a bunch of their peers have been pioneering a massively collaborative approach to solving hard mathematical problems.

Reflecting on the outcome of the first polymath experiment, Michael Nielsen wrote:

The scope of participation in the project is remarkable. More than 1000 mathematical comments have been written on Gowers’ blog, and the blog of Terry Tao, another mathematician who has taken a leading role in the project. The Polymath wiki has approximately 59 content pages, with 11 registered contributors, and more anonymous contributors. It’s already a remarkable resource on the density Hales-Jewett theorem and related topics. The project timeline shows notable mathematical contributions being made by 23 contributors to date. This was accomplished in seven weeks.

Just this week, a polymath blog has emerged to serve as an online home for the further evolution of this approach.


"The Devil is in the Digits"? No, I'd say they abound in the comments.
Jun 23rd, 2009 by analyticjournalism

An intriguing op-ed in The Washington Post on Saturday (June 20, 2009) claimed to spot fraud in the Iran elections by applying some analytic methods basically drawn from Benford's Law.  Yes, read the article, but be sure to drill down into the 140+ comments.  Most quite cogent and well argued.

The Devil Is in the Digits

Since the declaration of Mahmoud Ahmadinejad's landslide victory in Iran's presidential election, accusations of fraud have swelled. Against expectations from pollsters and pundits alike, Ahmadinejad did surprisingly well in urban areas, including Tehran — where he is thought to be highly unpopu…By Bernd Beber and Alexandra Scacco


Rise of the Data Scientist
Jun 4th, 2009 by analyticjournalism

Nathan, the chap who curates the valuable blog Flowing Data, offers up a bit of hope for journalists who are worried about their employment futures and yet have invested in learning methods of data analysis.  When thinking about re-inventing ourselves, consider the phrase “data scientist.”

Rise of the Data Scientist

Posted by Nathan / Jun 4, 2009 to Data Design Tips, Statistics / 6 comments

Photo by majamarko

As we've all read by now, Google's chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians. Obviously, I whole-heartedly agree. Heck, I'd go a step further and say they're sexy now – mentally and physically.

However, if you went on to read the rest of Varian's interview, you'd know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts.

Sexy Skills of Data Geeks

As a follow up to Varian's now-popular quote among data fans, Michael Discroll of Dataspora, discusses the three sexy skills of data geeks. I won't rehash the post, but here are the three skills that Michael highlights:

  1. Statistics – traditional analysis you're used to thinking about
  2. Data Munging – parsing, scraping, and formatting data
  3. Visualization – graphs, tools, etc.

Oh, but there's more…

These skills actually fit tightly with Ben Fry's dissertation on Computational Information Design (2004). However, Fry takes it a step further and argues for an entirely new field that combines the skills and talents from often disjoint areas of expertise:

  1. Computer Science – acquire and parse data
  2. Mathematics, Statistics, & Data Mining – filter and mine
  3. Graphic Design – represent and refine
  4. Infovis and Human-Computer Interaction (HCI) – interaction

And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We're seeing data scientists – people who can do it all – emerge from the rest of the pack.

Advantages of the Data Scientist

Think about all the visualization stuff you've been most impressed with or the groups that always seem to put out the best work. Martin Wattenberg. Stamen Design. Jonathan Harris. Golan Levin. Sep Kamvar. Why is their work always of such high quality? Because they're not just students of computer science, math, statistics, or graphic design.

They have a combination of skills that not just makes independent work easier and quicker; it makes collaboration more exciting and opens up possibilities in what can be done. Oftentimes, visualization projects are disjoint processes and involve a lot of waiting. Maybe a statistician is waiting for data from a computer scientist; or a graphic designer is waiting for results from an analyst; or an HCI specialist is waiting for layouts from a graphic designer.

Let's say you have several data scientists working together though. There's going to be less waiting and the communication gaps between the fields are tightened.

How often have we seen a visualization tool that held an excellent concept and looked great on paper but lacked the touch of HCI, which made it hard to use and in turn no one gave it a chance? How many important (and interesting) analyses have we missed because certain ideas could not be communicated clearly? The data scientist can solve your troubles.

An Application

This need for data scientists is quite evident in business applications where educated decisions need to be made swiftly. A delayed decision could mean lost opportunity and profit. Terabytes of data are coming in whether it be from websites or from sales across the country, but in an area where Excel is the tool of choice (or force), there are limitations, hence all the tools, applications, and consultancies to help out. This of course applies to areas outside of business as well.

Learn and Prosper

Even if you're not into visualization, you're going to need at least a subset of the skills that Fry highlights if you want to seriously mess with data. Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data.

Basically, the more you learn, the more you can do, and the higher in demand you will be as the amount of data grows and the more people want to make use of it.


"Interaction Design Pilot Year Churns Out Great Student Projects"
May 9th, 2009 by analyticjournalism

Another interesting post from “FlowingData

Interaction Design Pilot Year Churns Out Great Student Projects

In a collaborative initiative between Copenhagen Institute of Interaction Design and The Danish Design School, the Interaction Design Pilot Year brings together students and faculty from various disciplines for a unique brand of education.

The Interaction Design Pilot Year is a full-time intense curriculum that includes a number of skills-based modules (such as video prototyping and computational design), followed by in-depth investigations in to graphical/tangible interfaces and service design.

The end result? A lot of great work from some talented and motivated students. There are a number of project modules, but naturally, I'm most drawn to the interactive data visualization projects. Here are a few of the projects.

>Find much more in the student gallery.”



Craig's List had NOTHING to do with a decline in classified ad revenue.
Nov 14th, 2008 by analyticjournalism

IAJ co-founder Steve Ross has long argued that Craig's List HAS NOT contributed in a major way to the decline of North American newspaper advertising revenues.  Here's his latest analysis:

This isn't rocket science. Craig's List had NOTHING to do with a supposed decline in classified ad revenue.

Here's the raw PRINT classified revenue data, right off the NAA website. (If anyone doesn't use excel 2007 I can send the data file in another format, but everyone should be able to read the chart as a jpg).
Click here for bar chart

Note that the big change that pushed classified ad volume up in the 90s was employment advertising. Damn right. The country added 30 million new jobs in that period, and the number of new people entering the workforce declined because births had declined in the mid-1970s. More competition for bodies = more advertising needed.

Knock out the employment data and everything else stayed steady or INCREASED for newspaper classified.

The past 7 years were not as good for employment ads, but still better than in pre-web days.

There was indeed sharp deterioration in 2007 (and of course, 2001), as the economy soured.

There are some missing data (idiots) right around the time the web came in — 1993-4.

But just look at 1994-2006 — the “web years.” Total print classified ad dollar volume was $12.5 billion in 1994, $17 billion in 2006, roughly in line with inflation AT A TIME WHEN CIRCULATION FELL and even newspapers managed to get some new online revenue!!!

Look, I can do this with ad lineage (which didn't rise much at all but stayed ahead of circ declines), I can compare with display ad figures, I can do Craig's List cities vs non-Craig, I can add back the web revenue because in fact newspapers allocate revenue wrong, to preserve existing ad sales commission schemes, and thus undercount web revenue. I can do ad revenue per subscriber. And on and on.

All those corrections make this look even better for newspapers.

This is SO OBVIOUS that I just do not understand the “Craigs List has killed us” argument or even the”web killed us” argument.

It is (to me, anyway) a transparent lie. Either the newspaper barons are so inanely stupid that they don't understand their own business, or they are incompetent managers, looking for an excuse. Maybe both.

But oddly enough, Craig Newmark believes he did the damage. I've been on several panels with him where he has apologized for killing newspapers.

I might also add that some obviously web-literate societies are seeing a newspaper boom. Germany is an example.


Three Tuesdays workshop on data and the political campaigns at the Santa Fe Complex
Sep 27th, 2008 by Tom Johnson

Handicapping the Horserace

Published by Don Begley at 10:09 pm under Complex News, event

Handicapping the Horserace
    •September 30, 2008 – 6:30-8 pm  •October 7, 2008 – 6:30-8 pm  •October 14, 2008 – 6:30-8 pm

It’s human nature: Elections and disinformation go hand-in-hand. We idealize the competition of ideas and the process of debate while we listen to the whisper campaigns telling us of the skeletons in the other candidate’s closet. Or, we can learn from serious journalism to tap into the growing number of digital tools at hand and see what is really going on in this fall’s campaigns. Join journalist Tom Johnson for a three-part workshop at Santa Fe Complex to learn how you can be your own investigative reporter and get ready for that special Tuesday in November.

Over the course of three Tuesdays, beginning September 30, Johnson will show workshop participants how to do the online research needed to understand what’s happening in the fall political campaign. There will be homework assignments and participants will contribute to the Three Tuesdays wiki so their discoveries will be available to the general public.

Everyone is welcome but space will be limited. A suggested donation of $45 covers all three events or $20 will help produce each session. Click here to sign up.

  • The Daily Tip Sheet (September 30, 6:30 pm)

    Newspapers are a ‘morning line’ tip sheet. There isn’t enough room for what you need to know.

    Newspapers can be a good jumping-off point for political knowledge, but they rarely have enough staff, staff time and space to really drill down into a topic. Ergo, it is increasingly up to citizens to do the research to preserve democracy and help inform voters. Tonight we will be introduced to some of the city, state and national web sites to help in our reporting and to a few digital tools to help you save and retrieve what you find.
  • Swimming Against the Flow (October 7, 6:30 pm):

    How to track data to their upstream sources.

    A web page and its data are not static events. (Well, usually they are not.) Web pages and digital data all carry “signs” of where they came from, who owns the site(s) and sometimes who links to the sites. We will discuss how investigators can use these attributes to our advantage, and also take a step back to consider the “architecture of sophisticated web searching.”
  • The Payoff (October 14, 6:30 pm)

    Yup, it IS about following the money. But then what?

    Every election season, new web sites come along that make it easier to follow the money — election money. This final workshop looks at some of those sites and focuses on how to get their data into a spreadsheet. Then what? A short intro to slicing-and-dicing the numbers. (Even if you are a spreadsheet maven, please come and act as a coach.)

This workshop is NOT a sit-and-take-it-in event. We’re looking for folks who want to do some beginning hands-on (”On-line hands-on”, that is) investigation of New Mexico politics. And that means homework assignments and contributing to our Three Tuesdays wiki. Participants are also encouraged to bring a laptop if you can. Click here to sign up.

Tom Johnson’s 30-year career path in journalism is one that regularly moved from the classroom to the newsroom and back. He worked for TIME magazine in El Salvador in the mid-80s, was the founding editor of MacWEEK, and a deputy editor of the St. Louis Post-Dispatch. His areas of interest are analytic journalism, dynamic simulation models of publishing systems, complexity theory, the application of Geographic Information Systems in journalism and the impact of the digital revolution on journalism and journalism education. He is the founder and co-director of the Institute for Analytic Journalism and a member of the Advisory Board of Santa Fe Complex.


A bit of creative Analytic Journalism, Oprah-wise
Aug 11th, 2008 by Tom Johnson

The NYTimes moves an interesting short today describing how a couple of economists did some creative analysis suggesting that Oprah was worth a million-plus primary votes for Obama.

Endorsement From Winfrey Quantified: A Million Votes

Published: August 10, 2008

Presidential candidates make the most of celebrity supporters, showing
them off in television ads and propping them on podiums to stand and
wave. No doubt Mike Huckabee’s aborted campaign for the Republican nomination got some sort of bump from those commercials of him with Chuck Norris, right?

Or maybe not. Politicians and pundits routinely claim that celebrity
endorsements have little sway on voters, and two economists set out
recently to test the premise. What they found was that at least one
celebrity does hold influence in the voting booth: Oprah Winfrey.

The economists, Craig Garthwaite and Timothy Moore of the University of Maryland, College Park, contend that Ms. Winfrey’s endorsement of Barack Obama last year gave
him a boost of about one million votes in the primaries and caucuses.
Their conclusions were based partly on a county-by-county analysis of
subscriptions to O: The Oprah Magazine and sales figures for books that
were included in her book club.

Those data points were cross-referenced with the votes cast for Mr.
Obama in various polling precincts. The results showed a correlation
between magazine sales and the vote share obtained by Mr. Obama, and
extrapolated an effect of 1,015,559 votes.

“We think people take political information from all sorts of sources
in their daily life,” Mr. Moore said in an e-mail message, “and for
some people Oprah is clearly one of them.”

In their as-yet-unpublished research paper on the topic, the economists
trace celebrity endorsements back to the 1920 campaign of Warren
Harding (who had Al Jolson, Lillian Russell and Douglas Fairbanks in his corner), and call Ms. Winfrey “a celebrity of nearly unparalleled influence.”

The economists did not, however, look at how Ms. Winfrey’s endorsement
of Mr. Obama may have affected her own popularity. A number of people —
women in particular — were angry that Ms. Winfrey threw her first-ever
political endorsement to a man rather than his female opponent.

The research did not try to measure the influence of other stars’
endorsements; for instance, no similar measures were available for
Obama supporters like the actress Jessica Alba or Pete Wentz of the
band Fall Out Boy. “If a celebrity endorsement is ever going to have an
empirically identifiable audience, then it is likely to be hers,” the
researchers said of Ms. Winfrey. Sorry, Chuck Norris.

More good work out the UC Berkeley Viz Lab
Jul 31st, 2008 by Tom Johnson

A helpful post from Nathan at FlowingData

New Version of Flare Visualization Toolkit Released

Posted Jul 31, 2008 to Software, Visualization by Nathan
3 responses

New Version of Flare Visualization Toolkit Released

A new version of Flare,
the data visualization toolkit for Actionscript (which means it runs in
Flash), was just released yesterday with a number of major improvements
from the previous version. The toolkit was created and is maintained by
the UC Berkeley Visualization Lab and was one of the first bits of Actionscript that I got my hands on. The effort-to-output ratio was pretty satisfying, so if you want to learn Acitonscript for data visualization, check out Flare. The tutorial is a good place to start.

Here are some sample applications created with Flare:

Direct to the Dashboard
Jan 28th, 2008 by JTJ

 We've been a fan of the dashbroad approach for a long time because dashboard graphics can give readers a quick snapshot of multiple sets of dynamic data.  Charley Kyd, who studied journalism some years back, has developed a nifty plug-and-play package — Dashbroad Kit #1 — to generate these.  And below is a recent and relevant posting from Jorge Camoes that gives us some good tips on the topic.


10 tips to improve your Excel dashboard

Posted: 26 Jan 2008 06:42 PM CST

Posts in the series Excel Dashboard

  1. How to create a dashboard in Excel
  2. 10 tips to improve your Excel dashboard

Excel is a great (but underrated) BI tool. Several BI vendors gave up fighting it and offer Excel add-ins as front-ends for their BI solutions. So, if you want to create a dashboard you should consider Excel, since it really offers better functionalities than many other applications for a fraction of the cost and development time. I know that Excel is not a one-size-fits-all solution, but first you should be sure that your requirements are not met by Excel. Let me share with you some random tips from my experience with the Demographic Dashboard.

But, shouldn’t I just ask my IT to create the dashboard?

This is a fact: many IT departments hate Excel. The IT spends millions in BI solutions and users keep using Excel. Why? Because they know it, they like it, they feel in control and can do what ever they want with the data. Ask your BI manager to replicate the image above using an expensive BI solution and he’ll come back six month later with something you didn’t ask for, to answer a need you don’t have anymore (I know, I’m oversimplifying…). Do you know Master Foo Defines Enterprise Data?

1. Go to the point, solve a business need

So, you have your idea for a dashboard, you’ve discuss the project it with the users (right?) and you are ready. But where to start? Remember this: a graph, a table, the entire dashboard, are merely instrumental to solve a business need. It’s about insights, not about data, not about design.

2. Don’t use formulas

Yes, I know, this is Excel, and it is supposed to have formulas. What I am telling you is that you should aim at minimizing the number of independent formulas, and this should be a fundamental constraint to your global strategy. Too often I see Excel used as a database application. It is not, it is a spreadsheet (not everyone finds this obvious).

Over the years I had my share of “spreadsheet hell”: a lookup formula in the middle of nowhere would reference a wrong range for no apparent reason. An update cycle adds a new column and suddenly there are errors all over the place. You leave the project for a week and when you come back you don’t know what all those formulas mean. Even if everything goes smoothly the auditing dep wants to trace every single result.

But how do you minimize the use of formulas? If your data table resides in an Excel sheet you’ll have to rely heavily on lookup formulas, and that’s one of the highways to spreadsheet hell. Instead, get the data from an external source (access, OLAP cube…) and bring data into Excel. Calculations should be performed at the source. After removing all the formulas you can, the remaining should be as clear as possible.

3. Abuse Pivot Tables

Every object (graph, table) in the Demographic Dashboard is linked to a pivot table. Let me give you an example. One of the charts shows population growth over the years, using 1996 as reference. Pivot tables can calculate that directly, I don’t need to add a new layer of complexity by using formulas (to calculate the actual values and look up formulas to get them).

The population table has 200,000 records, so I coundn’d fit into the Excel limit of 65 thousand rows (yes, that’s changed in Excel 2007, but it is debatable if a table with a million rows in a spreadsheet application can be considered good practice). By using a pivot table I can overcome that limit.

4. Use named ranges

To be able to use self-document formulas (”=sales-costs” is much simpler to understand than “=$D$42-$F$55″) is one of several uses of named ranges. But they are also the building blocks of interaction with the user and they make your Excel dashboard more robust.

5. Use as many sheets as you need, or more

You don’t have to pay for each additional sheet you use in a workbook, so use as many as you need. Each cell in your dashboard report sheet should point to some other sheet where you actually perform the calculations. You should have at least three groups of sheets: a sheet with the dashboard report itself, sheets with the base data and other group with supporting data, definitions, parameters, etc. Add also a glossary sheet and a help sheet.

6. Use autoshapes as placeholders

Once you know what you need, start playing with the dashboard sheet. Use autoshapes to test alternative layouts or, better yet, use real objects (charts, tables…) linked to some dummy data.

7. Get rid of junk

There are two ways to wow your users: by designing a dashboard that actually answer needs, or by planting gauges and pie charts all over the place (this one can guarantee you a promotion in some dubious workplaces, but it will not help you in the long run). In the series on Xcelsius Dashboards you can see how difficult is to create something beyond the most basic and irrelevant charts.

So, get rid of Excel defaults (take a look at this before/after example) and just try to make your dashboard as clean and clear as possible. You’ll find many tips around here to improve your charts, so I’ll not repeat myself.

8. Do you really need that extra-large chart?

Charts are usually larger than they should. What it really matters in a chart is the pattern, not the individual values, and that can be seen even with a very small chart.

9. Implement some level of interaction

A dashboard is not an exploratory tool, is something that should give you a clear picture of what is going on. But I believe that at least a basic level of interactions should be provided. User like to play with the tools and can they learn a lot more than just looking at some static image.

10. Document your work

Please, please, structure and document your workbook. Excel is a very flexible environment, but with flexibility comes responsibility… I am not a very organized person myself, but from time to time I try the tourist point of view: I pretend I never saw that file in my life and I’ll try to understand it. If I can’t or takes me too long, either I must redesign it or write a document that explains the basic structure and flow.

Bonus tip: there is always something missing…

Once you have a prototype, user will come up with new ideas. Some of them can be implemented, others will ruin your project and if you accept them you’ll have to restart from scratch. So, make sure the specifications are understood and approved and the consequences of a radical change are clear.

This is far too incomplete, but I’ll try to improve it. Will you help? Do you have good tips specific to the design of Excel dashboards? Please share them in the comments.


The Dataweb and the DataFeret
Jan 3rd, 2008 by Tom Johnson

Marylaine Block's always informative “Neat New Stuff” [Neat New Stuff I Found This Week at] tipped us to the DataWeb site and its interesting tool, the Data Feret (or “dataferet”).

“TheDataWeb is a network of online data libraries that the DataFerrett application accesses the data through. Data topics include, census data, economic data, health data, income and unemployment data, population data, labor data, cancer data, crime and transportation data, family dynamics, vital statistics data, . . . As a user, you have an easy access to all these kinds of data. As a participant in TheDataWeb, you can publish your data to TheDataWeb and, in turn, benefit as a provider to the consumer of data.”

What is the DataFerrett?
DataFerrett is a unique data mining and extraction tool. DataFerrett allows you to select a databasket full of variables and then recode those variables as you need. You can then develop and customize tables. Selecting your results in your table you can create a chart or graph for a visual presentation into an html page. Save your data in the databasket and save your table for continued reuse. DataFerrett helps you locate and retrieve the data you need across the Internet to your desktop or system, regardless of where the data resides. DataFerrett:
* lets you receive data in the form in which you need it (whether it be extracted to an ascii, SAS, SPSS, Excel/Access file); or
* lets you move seamlessly between query, analysis, and visualization of data in one package;
* lets data providers share their data easier, and manage their own online data.
DataFerrett Desktop IconDataFerrett runs from the application icon installed on your desktop.

Check it out at


»  Substance:WordPress   »  Style:Ahren Ahimsa