msmemory_archive: (Default)
msmemory_archive ([personal profile] msmemory_archive) wrote2008-11-12 09:54 am

Brilliant use of search data

NY Times reports that Google is now charting the spread of the flu using aggregated search data superimposed on zip/state data. If there's a spike in searches on "flu symptoms" for example, in New England, then the hypothesis is that there is an uptick in the incidence of flu. The data are supporting the notion, too. That is SO COOL.

http://www.nytimes.com/2008/11/12/technology/internet/12flu.html

[identity profile] goldsquare.livejournal.com 2008-11-17 06:59 pm (UTC)(link)
I suppose I should give two examples.

The question is: how does Google perform geolocation of the user, when the user is using an IP address that is not allocated geographically? (Remember, if I have a static IP on my laptop, I can plug it in anywhere in the world....)

There are two methods that might be used - one of which is very inefficient but obvious. Using a PC or Unix box, hack your IP packets that go to Google so they contain a false reply address, presuming that Google uses the reply address to perform traceroute pinging to determine your location. For spoofing purposes, you don't care if your answer gets lost. I doubt like heck that they do this, because it is expensive as a way to perform geo-location of an IP address.

The other is to use the same technique they use to load balance - when you contact Google.com, a DNS lookup is performed and an address is returned to you. Many large-scale server systems (such as Googles) load-balance by assigning unique or varied DNS replies based upon information they have about the DNS server you use, or other information related to that query. There are about 4-5 algorithms in use, some of which are patented and therefore easy to find.

To create fraudulent requests, either consistently use a novel DNS server, or first collect a number of Google server addresses that correspond to a given region, and use those repeatedly.

Combine the two techniques, and the results are likely to be wildly successful. Use a server farm, or bot-net, or some distributed tool, and you can deeply amplify the result.

Is that hard? Not really, beyond startup costs for programming tools and some data collection.

(To go beyond that, you can do some truly exciting work with authorities and BGP....)