Mudwrestling Slippery Data: Google Insights

Today Andrew Chen posted about his experiments using Google Insights for Search to draw conclusions about the audiences of a variety of websites.

He looked at MySpace, Twitter, Digg, Facebook and a few others. He used the geographic distribution of people searching for the name of each site as a proxy. If a site’s searchers were concentrated in California, he concluded that it had only caught on with the Silicon Valley / Early Adopter crowd. Other sites, like MySpace, showed fairly uniform distribution across the US. Facebook had broad distribution, but skewed towards the east coast. Perhaps most interestingly, Twitter looked like it was on its way to wider adoption.

Andrew also looked at the distribution of people searching for Techcruch. He found that almost all of them were in California. This attracted the attention of Erick Schoefield, a writer for TechCrunch. He was surprised by Andrew’s results and decided to dig deeper. He noticed that Andrew had searched for the full domain name (ie Techcrunch.com), rather than just the site name (ie Techcrunch). When Erick searched for just “Techcruch” he saw a much broader distribution. It was still heaviest in California, but tech and media centers like WA, TX, MA & NY also had a strong showing.

I’d been meaning to check out Google Insights for myself, and these two posts gave me the impetus to finally do something about it. My background is in Biology, which I majored in in college, so I was taught to approach any data, and the conclusions drawn from it, with a skeptical eye. It’s taken me a long time to get used to the sloppy with which numbers are used in a business setting. Whenever I see a graph, I’d want to see the error bars. I’ve learned to get used to it. You have to when even commercial data sets that cost hundreds or thousands to get acccess to have all sorts of quirks and caveats attached to them.

So, I look for possible biases in the underlying data. The most obvious question is how closely searching with a site correlates with actually using the site. This is a question I can’t afford to answer with much precision, especially since I don’t (yet) have access to the traffic stats on a reasonably popular site to calibrate against, but I do know that a lot of people search for sites they regularly use rather than using a bookmark or typing in the URL directly. It may actually require fewer keystrokes, clicks, and cognitive load. So I’ll accept the idea that it’s a reasonable proxy.

Still, I wondered if it wasn’t a better leading indicator of interest, rather than an indicator of actual usage. Again, this isn’t something I’m going to be able to nail down, but it did lead me to discover that it is possible to slice the Google data month by month. I’ll show the results below for Twitter.com

I wanted to start before with Twitter’s launch in July 2006, so I could get a baseline, but the baseline was effectively 0, because Google gives an error that there isn’t enough data on the volume of searchs for “twitter” until you get to Jan of 2007.

January ’07

April ’07

July ’07

October ’07

January ’08

April ’08

July ’08

Being able to look at a time series of data is awesome, because you can start to eyeball relative trends and not worry so much about the inaccuracies of absolute measurements (assuming the methodology for data collection remains pretty constant from period to period).

Looking at this series of graphs shows a clear progression. There seems to be echoes between Twitter’s overall traffic and the trends that Google Insights reveals. I’ll pull up a traffic graph from Compete.com & Quantcast.com. Unfortunately, I can’t go further back than a year without coughing up some dough, but the graphs are helpful none the less:

Interesting to look at everything together. The Google Insight maps change dramatically between January and April of 2007, but then look pretty similar for the rest of 2007. The Compete and Quantcast graphs look pretty flat over that same period.

The Jan 2008 graph shows a bit of a geographical advance just as the compete.com and quantcast graphs show an upward inflection after a period of slow growth.

Anyway, that’s all I have time for right now. Thanks to Andrew for getting the ball rolling and to Erick for refining the methodology. I hope this post ads to the body of knowledge on how to use Google Insights data.

Update: Ok, so I’ve been playing with Google Insights a little more. I’m running into some behavior that surprises me. It looks like there is a bug when you search multiple date ranges.

One thought on “Mudwrestling Slippery Data: Google Insights

  1. Pingback: Regional Interest in Firefox - within China and India < Blog of Metrics

Leave a Reply

Your email address will not be published. Required fields are marked *