Corpus-based research — either for deep insight into language or to try to answer practical questions of language use — suffers from several difficulties. First, the corpora of authentic language use that have been assembled are usually not big enough. Second, they are usually biased in certain ways (for example, containing mostly examples of academic or journalistic text rather than texts from other contexts and sources). In addition, corpora are also often clumsy to access.
My guess is that many people are starting to use Google as a huge corpus, mostly for practical information but occasionally for more academic pursuits. I personally use Google as a corpus several times a week, when I try to figure out, for example, when I try to find out whether Chanukah or Hanukkah is the more common spelling (as it turns out, it’s the latter, by more than a three to one margin).
Though this post in more than a year old, it includes some very interesting commentary on the dangers of using Google as a corpus. Follow the links in the article for some further discussion of related points. The bottom line is there is a lot of junk on Google that can seriously distort results. I doubt if the distortion would seriously affect my own research on the spelling of the Jewish holiday, but more serious usage inquiries could be thrown a curve.
Have you seen the discussion on the Association for Internet Researchers (air-l) list? They recently (as in just this week!) had a discussion about a very similar topic. The URL is http://www.aoir.org. There is a link to the archives here.
Hope this helps,
Laura