Thomas Koenig
Guest
|
| Posted: Thu May 12, 2005 6:33 am
Post subject: Re: Thrice as much |
|
|
Donna Richoux wrote:
| Quote: | The other reasons not to trust numbers obtained with "site:" are (l) not
all UK sites have "uk" in the address;
|
True, but very few US/AUS/other clutter sites are hosted under the .uk
domain.
| Quote: | (2) not all writing on "site:uk"
page are written by UK speakers;
|
Also true, but the proportion of BE speakers on .uk sites should easily
be higher than the ratio on other domains. (as a native German speaker
and AE imposter, I'm trying my best on my .uk site to confuse these
stats, though).
| Quote: | I've used the "site" trick as a fast way to get a rough estimate, but I
really wouldn't stake anything on it, not at this point.
|
I hear similar objections/cautions w/ respect to various google stats on
all sorts of occasions. I am yet to meet a convincing criticism that
cannot be remedied with Statistics 101. |
|
Donna Richoux
Guest
|
| Posted: Thu May 12, 2005 3:33 pm
Post subject: Re: Thrice as much |
|
|
Thomas Koenig <fossa@gmx.li> wrote:
| Quote: | Donna Richoux wrote:
The other reasons not to trust numbers obtained with "site:" are (l) not
all UK sites have "uk" in the address;
True, but very few US/AUS/other clutter sites are hosted under the .uk
domain.
(2) not all writing on "site:uk"
page are written by UK speakers;
Also true, but the proportion of BE speakers on .uk sites should easily
be higher than the ratio on other domains. (as a native German speaker
and AE imposter, I'm trying my best on my .uk site to confuse these
stats, though).
I've used the "site" trick as a fast way to get a rough estimate, but I
really wouldn't stake anything on it, not at this point.
I hear similar objections/cautions w/ respect to various google stats on
all sorts of occasions. I am yet to meet a convincing criticism that
cannot be remedied with Statistics 101.
|
Well, I'm pleased to hear it, and may call on you to solve some of the
problems *I've* encountered. Unfortunately, I don't keep a record of
them, so I can't quickly haul out description and evidence.
Off the top of my head, I remember these numerical problems:
l) The minus problem -- adding a term preceded by a minus sign (to
excluse it) can actually increase the estimated number of results. (Note
that I mean the estimation, not the actual number.) Since this is not
logically possible, it can only mean one or the other of the two
estimations (with the minus term, without the minus term) was unreliable
(or both).
2) The geographic variation problem -- someone in, say, California
running a search can get totally different estimation numbers for the
same search as someone in, say, Europe. Not just mildly different, but
WAY different, like ten or fifty or a hundred times as much.
The related erratic problem -- sometimes the person in Europe, after
a few days, would start getting numbers similar to the California ones.
3) The "cat dog" problem Mark Brader reported a couple of years ago,
where certain combinations of words yielded zero result even though
other searches showed they existed.
4) The estimation/reported hits variation -- too often, the Google
estimation figure in the top corner reports a high number, but when you
examine the list of actual hits it can find, it is piddling few. Even
accounting for the suppression of duplicates. To me that says the
estimate was wrong, thrown off by some unknown circumstance.
People here know I like Google for various tasks, and I think the
estimation numbers do show something when used cautiously. But Google
uses some sort of formula to generate those estimates, and it has flaws.
I know the above items are sketchy and if you are serious about wanting
to investigate them, I can supply more details.
--
Best wishes -- Donna Richoux |
|
Mark Brader
Guest
|
| Posted: Fri May 13, 2005 3:40 am
Post subject: Re: Thrice as much |
|
|
Donna Richoux writes:
| Quote: | Off the top of my head, I remember these numerical problems:
l) The minus problem ...
2) The geographic variation problem ...
3) The "cat dog" problem ...
4) The estimation/reported hits variation -- too often, the Google
estimation figure in the top corner reports a high number, but when you
examine the list of actual hits it can find, it is piddling few. Even
accounting for the suppression of duplicates. To me that says the
estimate was wrong, thrown off by some unknown circumstance.
|
In fact, that problem is the reason why I didn't post any Google counts
in relation to this thread. On some of my phrase searches the estimated
hit counts seemed suspiciously high (in the range 50,000 to 250,000, as
I recall), so I asked for 100 hits per page and started stepping through
pages. Sure enough, Google ran out of hits well before I reached the
1,000-hit limit, and when I asked it to include suppressed duplicates,
it still did.
There is also (5) counts for search terms with different numbers of
words sometimes seem out of whack, especially when it's a single word
versus a short, common phrase including it. I haven't seen this lately
and don't remember a specific example, but I presume it's related to
the way the information from different words in the phrase is combined.
It might very well be related to the minus problem.
--
Mark Brader | "I don't have to stay here to be insulted."
Toronto | "I realize that. You're insulted everywhere, I imagine."
msb@vex.net | -- Theodore Sturgeon
My text in this article is in the public domain. |
|