Brilliant use of search data
Nov. 12th, 2008 09:54 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
NY Times reports that Google is now charting the spread of the flu using aggregated search data superimposed on zip/state data. If there's a spike in searches on "flu symptoms" for example, in New England, then the hypothesis is that there is an uptick in the incidence of flu. The data are supporting the notion, too. That is SO COOL.
http://www.nytimes.com/2008/11/12/technology/internet/12flu.html
http://www.nytimes.com/2008/11/12/technology/internet/12flu.html
no subject
Date: 2008-11-12 03:01 pm (UTC)It is slightly invasive of privacy.
And since people are now aware of it, it can be spoofed.
no subject
Date: 2008-11-12 03:22 pm (UTC)But spoofable? Well, yeah. There is that.
Still, Hari Seldon would be proud %^).
no subject
Date: 2008-11-12 03:43 pm (UTC)After all, my company just introduced a product line that does advertising based upon behavioral targeting... And I've worked in this arena for a while.
Suffice it to say that someone I know is considering bankruptcy, and I thought quite a bit before I did web searches on the topic...
no subject
Date: 2008-11-12 03:53 pm (UTC)no subject
Date: 2008-11-12 04:11 pm (UTC)Do you ever use the same credit card for prescriptions and regular purchases?
Common identifiers are common. :-)
no subject
Date: 2008-11-14 12:00 am (UTC)no subject
Date: 2008-11-12 04:12 pm (UTC)So? Some small percentage of the population are griefers. How does the existence of this tool change that in any way?
You might as well say, "Great, now that we know how to make fire, some people will burn other people's huts down." Does that make the invention of fire a net loss?
no subject
Date: 2008-11-12 04:17 pm (UTC)There are many measures that become much less valuable or sensitive once the subject knows they exist. I venture to say that this is one of them.
no subject
Date: 2008-11-12 04:23 pm (UTC)Surely you realize how much of an outlier you are about your sensitivity to privacy issues. Likewise, spoofers are rare.
Yes, knowledge of this measure does, in some very small degree, make it less accurate. But knowledge of it *also* makes it much more accessible, greatly increasing the benefits. Seems like a clear win to me.
no subject
Date: 2008-11-12 04:34 pm (UTC)Yes, I am an outlier on how important such things are to me. And yet: I blog. :-) I contain multitudes.
I think it would be trivial for someone who has access to some of the SPAM botnets to use them to drive false data. Should they choose to. I can think of several ways to do so without botnets, but they are trickier.
My job, my professional expertise, involves understanding how such measures are vulnerable to skew, and how to stop or track that skew. It is what I do. I may be, in your eyes, ultra-paranoid. At the same time, such techniques of data mining represent rather dangerous intrusions into personal privacy. THIS USE may be innocent. But it is a model for others that might not be.
And if I were a sophisticated terrorist, knowing that I could spoof CDC and law enforcement in this way would be a powerful tool.
Frankly, if I wanted CDC and others to react to a prevalence of flu, I would not use indirect methods to get them to do so: if I were "The Man In The White Hat". I'd give them and local boards of health a phone call.
So, knowing this exists does not help the average person. Knowing it exists helps the bad guy. And seeing if this sort of profiling works can hurt the average person, in the long run.
no subject
Date: 2008-11-16 03:39 am (UTC)On the one hand, this is true. OTOH, it is no truer of this than it is of Google Trends in general. (Which I assume it's based on.) This particular horse is long since out of the barn, and this is just a minor instance of a more general matter.
And since people are now aware of it, it can be spoofed.
Again, precisely as true as any of the rest of Google Trends. There are a fair number of obvious ways to at least partially counter the effect, and if Google is using even the most basic ones, it is pretty unlikely that anyone is going to be able to skew the statistics without being horribly obvious about it. It's important to note that there are several ways to drive up the traffic on a particular subject, but I don't see any that are likely to succeed in doing so in geographically-controlled ways.
So seriously: I disbelieve. I'm sure it's hypothetically possible, but I don't see a practical way to manage the geographic balance of the spoofing. And that geographic balance is the point of the exercise. (As opposed to most Google Trend spoofing, which is all about simply increasing traffic on a subject...)
no subject
Date: 2008-11-16 11:37 pm (UTC)And yet, while not trivial (it is some work), it is not particularly hard.
It depends on exactly how it is that Google does geographic plotting of requests. I know how my employer does it, and a few of our competition, and all of them can be spoofed with little at-home effort.
no subject
Date: 2008-11-17 06:59 pm (UTC)The question is: how does Google perform geolocation of the user, when the user is using an IP address that is not allocated geographically? (Remember, if I have a static IP on my laptop, I can plug it in anywhere in the world....)
There are two methods that might be used - one of which is very inefficient but obvious. Using a PC or Unix box, hack your IP packets that go to Google so they contain a false reply address, presuming that Google uses the reply address to perform traceroute pinging to determine your location. For spoofing purposes, you don't care if your answer gets lost. I doubt like heck that they do this, because it is expensive as a way to perform geo-location of an IP address.
The other is to use the same technique they use to load balance - when you contact Google.com, a DNS lookup is performed and an address is returned to you. Many large-scale server systems (such as Googles) load-balance by assigning unique or varied DNS replies based upon information they have about the DNS server you use, or other information related to that query. There are about 4-5 algorithms in use, some of which are patented and therefore easy to find.
To create fraudulent requests, either consistently use a novel DNS server, or first collect a number of Google server addresses that correspond to a given region, and use those repeatedly.
Combine the two techniques, and the results are likely to be wildly successful. Use a server farm, or bot-net, or some distributed tool, and you can deeply amplify the result.
Is that hard? Not really, beyond startup costs for programming tools and some data collection.
(To go beyond that, you can do some truly exciting work with authorities and BGP....)
no subject
Date: 2008-11-12 03:34 pm (UTC)wow. group of geeks are we? YAY GEEKS!
no subject
Date: 2008-11-12 04:09 pm (UTC)Even to the incidental detail that the head of Google.org is named "Dr. Brilliant" :-)