msmemory_archive | Brilliant use of search data

Brilliant use of search data

NY Times reports that Google is now charting the spread of the flu using aggregated search data superimposed on zip/state data. If there's a spike in searches on "flu symptoms" for example, in New England, then the hypothesis is that there is an uptick in the incidence of flu. The data are supporting the notion, too. That is SO COOL.

http://www.nytimes.com/2008/11/12/technology/internet/12flu.html

Flat | Top-Level Comments Only

It is slightly invasive of privacy.

On the one hand, this is true. OTOH, it is no truer of this than it is of Google Trends in general. (Which I assume it's based on.) This particular horse is long since out of the barn, and this is just a minor instance of a more general matter.

And since people are now aware of it, it can be spoofed.

Again, precisely as true as any of the rest of Google Trends. There are a fair number of obvious ways to at least partially counter the effect, and if Google is using even the most basic ones, it is pretty unlikely that anyone is going to be able to skew the statistics without being horribly obvious about it. It's important to note that there are several ways to drive up the traffic on a particular subject, but I don't see any that are likely to succeed in doing so in geographically-controlled ways.

So seriously: I disbelieve. I'm sure it's hypothetically possible, but I don't see a practical way to manage the geographic balance of the spoofing. And that geographic balance is the point of the exercise. (As opposed to most Google Trend spoofing, which is all about simply increasing traffic on a subject...)

I'm sure it's hypothetically possible, but I don't see a practical way to manage the geographic balance of the spoofing.

And yet, while not trivial (it is some work), it is not particularly hard.

It depends on exactly how it is that Google does geographic plotting of requests. I know how my employer does it, and a few of our competition, and all of them can be spoofed with little at-home effort.

I suppose I should give two examples.

The question is: how does Google perform geolocation of the user, when the user is using an IP address that is not allocated geographically? (Remember, if I have a static IP on my laptop, I can plug it in anywhere in the world....)

There are two methods that might be used - one of which is very inefficient but obvious. Using a PC or Unix box, hack your IP packets that go to Google so they contain a false reply address, presuming that Google uses the reply address to perform traceroute pinging to determine your location. For spoofing purposes, you don't care if your answer gets lost. I doubt like heck that they do this, because it is expensive as a way to perform geo-location of an IP address.

The other is to use the same technique they use to load balance - when you contact Google.com, a DNS lookup is performed and an address is returned to you. Many large-scale server systems (such as Googles) load-balance by assigning unique or varied DNS replies based upon information they have about the DNS server you use, or other information related to that query. There are about 4-5 algorithms in use, some of which are patented and therefore easy to find.

To create fraudulent requests, either consistently use a novel DNS server, or first collect a number of Google server addresses that correspond to a given region, and use those repeatedly.

Combine the two techniques, and the results are likely to be wildly successful. Use a server farm, or bot-net, or some distributed tool, and you can deeply amplify the result.

Is that hard? Not really, beyond startup costs for programming tools and some data collection.

(To go beyond that, you can do some truly exciting work with authorities and BGP....)

Flat | Top-Level Comments Only

Brilliant use of search data

no subject

no subject

no subject