# Google numbers (again)



## mhp

This is the second time I make this request. 

I ask that posts that merely quote Google numbers over 1000 be removed as "off topic". There are two reasons for this request.

First: Google has not published how it arrives at these numbers. In fact the algorithm is a "trade secrete." These numbers are misleading and often change by several orders of magnitude from hour to hour and from server to server.

Second: Google never lists hits over 1000. This means that no conclusion can be drawn by looking at 3000 hits versus 3,000,000 hits.

----
Edit: There is a third and more important reason that is so obvious that I failed to mention: If we take these two premises to be true, such posts detract from and confuse the issue.


----------



## fenixpollo

So, what you're saying is that the forum community actively discourage the use of Google search results as a source (as data to support an argument). Is that correct?


----------



## mhp

Hi fenixpollo,

Yes, that's what I am asking, but with some additional considerations.

For example, if there are only 100 hits in Google (Web, Books, Scholar, etc) I can see that this can be useful to point out in some circumstances. My problem is when these types of arguments are extended to hits over 1000: "Look, this is by far more common than that because Google gives 3000 hits for this and 3,000,000 hits for that."

This may seem like a compelling argument, but since no hits over 1000 can be listed, this statement is in fact as good as saying: "I have no idea, but I am going to type some numbers for your enjoyment."


----------



## Outsider

At the very least, Google hits can be used to show that _someone_ uses a certain word or phrase.

Of course that does not prove that the sentence or phrase is correct, or that it is more correct, or more frequent, than another word or phrase with fewer hits. People should be reminded that Google hits will include text written by non-natives, and by natives with substandard literacy.

As long as these caveats are understood by all, I don't see a need to reject argument-by-Googling outright.


----------



## timpeac

Google results need to be taken with a large pinch of salt - a very large pinch of salt - but I don't think they're worthless over 1,000. I agree that it is annoying Google won't reveal their algorithm so we can judge any assumptions for ourselves, but I would take 3,000 versus 3,000,000 that the second was more common (although not necessarily 1,000 times more common that the numbers might suggest).

I think the key is more to make sure everyone understands Google's (big) limitations rather than banning reference to it full-stop.


----------



## mhp

Outsider said:


> At the very least, Google hits can be used to show that _someone_ uses a certain word or phrase.
> 
> Of course that does not prove that the sentence or phrase is correct, or that it is more correct, or more frequent, than another word or phrase with fewer hits. People should be reminded that Google hits will include text written by non-natives, and by natives with substandard literacy.
> 
> As long as these caveats are understood by all, I don't see a need to reject argument-by-Googling outright.


    Outsider, unfortunately, there is a sharp learning curve involved.

  I am somewhat of an expert in this field (modesty aside) and I realize that for a layman who has just begun to experiment with wonders of search engines, or even more daring ones who dabble with regular expressions, Google may seem like a god-sent. It is not.

  It is essentially impossible to search though a database the size of Google’s in a fraction of a second and come up with semi accurate estimates of proximity of words, let alone resolution of a regular expression. It is to Google’s credit that they can come up with relevant hits based on their rank system.

  I really don’t want to get into details. The fact is that _large_ numbers reported by Google are not based on actual matches in the database.

  I really cannot hope to educate every person who participates in these forums about such intricacies. The best I can hope is to make moderators aware that numbers over 1000 are essentially meaningless and strongly advise that posts that make arguments based on these numbers be deleted as off-topic.

  For example see this thread that was just posted.


----------



## timpeac

mhp - perhaps you could put a few pointers together on the subject that we could turn into a sticky post that we could direct people to each time the subject comes up? It could be the basis of a "what Google can and can't do" post.


----------



## fenixpollo

I see what you are saying (I think): if there are less than 1000 Google results, a forero could use that data to support their argument that a particular usage is not common; but if there are more than 1000 Google results, a forero could not use that data to support their argument that a particular usage is common, because of the way that Google arrives at those large numbers.

While I agree that it's impossible to educate all members about Santa Google's limitations, it's possible to educate key senior members (and anyone else who will listen). If the moderators and active senior members are dispelling the myth of Google's accuracy, and not accepting (large) Google numbers as valid data, then we will see a change in the forum. After all, the culture of this place is made by its members.

So I don't think an out-and-out prohibition of Google numbers is necessary. However, I do think that those who are educated on the subject should educate others and dispel misconceptions whenever they are posted.

mhp, I liken your battle against Google to the battle against chatspeak that is fought in this forum on a daily basis. It seems like a losing battle (or at least neverending), but that's not a good reason to stop fighting.


----------



## cuchuflete

Google projections of "hits" are estimates.  They are not facts or actual counts.  That said, there is utility in using search engine projections as indicators of direction and magnitude.  While many people make the error of taking search engine "results" as
fact, one can overstate the limitations of their utility:  "...make moderators aware that numbers over 1000 are essentially meaningless".  No.  They are not meaningless.  They are not actual counts, but that doesn't make them meaningless either.  

A careful reader or writer in these forums is going to consult a corpus, if one is available.  Other people will depend on their own personal experience with a term.
The latter is no more or less a matter of facts than what a search engine can provide in its estimates.  That doesn't make it meaningless either.  

What is just plain wrong and misleading is to use search engine reports as if they were ironclad statements of any statistical accuracy:  _Blupblup is five times more common than blipblap, according to Gugle.  _Search engine counts may be useful, but both readers and writers need to be aware of (1) the difference between actual citations captured and available for viewing, and (2) projections.  

Consider some of the other things one finds in a language forum:  A non-native learner of Xxxxian asks if a word is common.  Five native speaker foreros respond, three saying that they have never seen it.  Two more assert that the word is well known to them.  Unless and until we know something of the reading and speaking contexts of these five respondents, perhaps together with their locations and ages, we have nothing factual, yet the statistically invalid statements are not meaningless.  They tell the questioner that the term is probably known to some, but not all native speakers. Should we outlaw all comments that are not accompanied by copious citations?  That would not be a useful approach.


----------



## mhp

Well, I was going to suggest that I may be willing to write a short sticky on this. But aside from my own reluctance, I see there is no consensus. 

  “They are not actual counts, but that doesn't make them meaningless either.”

  This is where the sharp learning curve comes in. No, they are not entirely meaningless in the context where they are derived.

But then it begs the question of what that context is. That context turns out to be a balance between computation cost (i.e. speed) versus actual matches. This becomes even more complicated by how many such queries where cached---i.e. pre-computed based on demand.


----------



## timpeac

Although I don't necessarily disbelieve what you say, you're being almost as mysterious as the Google algorithms about the reasons you mistrust them.

My layman's stance is that Google results should be taken with a lot of salt, but that a result such as 3,000 compared to 3,000,000 will sway me that the second is much more common (how much, where and why I wouldn't infer from that fact).

If it's not reasonable to make that assumption then in much the same way we would all like more information on Google algorithms I would need more information on your reasoning. Without it I see no reason to ban Google results as a representation of nothing more, or indeed less, than what they are - an algorithm for interpreting a count within a huge corpus of data.



mhp said:


> Well, I was going to suggest that I may be willing to write a short sticky on this. But aside from my own reluctance, I see there is no consensus.
> 
> “They are not actual counts, but that doesn't make them meaningless either.”
> 
> This is where the sharp learning curve comes in. No, they are not entirely meaningless in the context where they are derived.
> 
> But then it begs the question of what that context is. That context turns out to be a balance between computation cost (i.e. speed) versus actual matches. This becomes even more complicated by how many such queries where cached---i.e. pre-computed based on demand.


----------



## cuchuflete

I just queried three search engines for two different terms.  I clicked through to the end of the actual displayed page citations, finding 1000 for each of the terms.  The projected "results" for one of the terms was between 3 and 4 times greater than for the other with all three search engines.  Unless one assumes that the search engine algorithms are absolute junk, nothing more than wild guesses with no basis at all in collected data, then the results are far from meaningless. 


I think it is both reasonable and useful to draw the conclusion that one of the terms is more frequently used than the other. If, on the other hand, one were to say with certainty and conviction, "Term B is more than three times more common than Term A", I would challenge that assertion.


_Edit:  Using the BYU corpora for both American and for British English, one of the terms was found with much greater frequency than the other in both English variants, in accord with what the search engine results suggested. _


----------



## timpeac

mhp said:


> I really don’t understand this. I am calling attention to something that may not be obvious to the uninitiated. If my previous reference about “_The most important airlines of the world_” is not good enough, perhaps 117,000 versus 20 is (numbers quoted at the time of writing). I am not trying to prove anything. What I am saying is with the best intention to improve the quality of these forums. If not welcomed, I’ll happily go my way.


Of course it's welcome. You just haven't proven your point sufficiently yet. You keep talking about "the uninitiated" while suggesting that because you are not in that group you have superior knowledge that should be taken at face value while being coy about the details "the initiated" are privy to.

For your first link there, yes - it's a bad search. It will pick up a whole host of irrelevant hits because of the "*".  For the second, well now both give me 12 hits. However, in order to proscribe a whole resource I for one would like to know more about the reasoning behind that, rather than a few cherry-picked examples.

No one disagrees that google results are very far from a perfect indicator. Over and above the efficacy of any algorithm the terms have to be entered correctly, and the results carefully reviewed (for example a "google fight" between the terms "happy" and "content" will be swayed in "content"'s favour because it has more than one meaning, and is also the French for "happy" etc). However, that doesn't make it completely useless in drawing very broad conclusions about properly constructed and reviewed queries - unless you have further information on this you'd like to impart.


----------



## giovannino

cuchuflete said:


> I think it is both reasonable and useful to draw the conclusion that one of the terms is more frequently used than the other. If, on the other hand, one were to say with certainty and conviction, "Term B is more than three times more common than Term A", I would challenge that assertion.


 
This seems quite reasonable to me.

mhp, nobody is saying that you are not right in pointing out the limitations of Google searches.
However, with languages like Italian, for which no online corpora are available, Google searches are the only source, however flawed, of information on relative frequency. I did the same as Cuchuflete and searched for a few words or phrases which have alternative, less common variants (actually labelled as _meno comune _in dictionaries). In all cases the more common variants got many more hits.

As for the "airlines" example, if the forer@ had not replaced "airlines" with "*", the results for "the most important airlines of the world" would have been only eight: one from the WR thread and seven from non-native sources.


----------



## JamesM

Can you at least point us to information that we could read on the subject? I agree that the numbers are not accurate representations of total count and that there is no way to verify them but I also don't see how they are completely irrelevant and off-topic. 

I'm sure you can imagine that anyone making a suggestion that we delete all posts about any topic or related to any reference (Serbo-Croatia as a country, for example, or the linguistic relationship of Urdu and Hebrew, as two random artificial examples) would not receive immediate compliance and a wholehearted embrace of such a policy change without substantial information besides their judgment, however well-informed, that such references were off-topic.

I understand your suggestion and your stated reasons for it, but I'm sure that you can also see that it would be imprudent for any board to adopt a policy of deleting posts based on one user's judgment, especially if the user feels he can't go into any detail on the subject.

You are welcome to make your suggestion. We are equally free to discuss it and question it. Discussing it, or even challenging it, does not indicate a lack of welcome.


----------



## mhp

Yes I can point you to string matching algorithms. I can even give  references to positional search, and heuristic web search. But I'm not sure if you find these useful.


----------



## mhp

fenixpollo said:


> I see what you are saying (I think): if there are less than 1000 Google results, a forero could use that data to support their argument that a particular usage is not common; but if there are more than 1000 Google results, a forero could not use that data to support their argument that a particular usage is common, because of the way that Google arrives at those large numbers.
> 
> While I agree that it's impossible to educate all members about Santa Google's limitations, it's possible to educate key senior members (and anyone else who will listen). If the moderators and active senior members are dispelling the myth of Google's accuracy, and not accepting (large) Google numbers as valid data, then we will see a change in the forum. After all, the culture of this place is made by its members.
> 
> So I don't think an out-and-out prohibition of Google numbers is necessary. However, I do think that those who are educated on the subject should educate others and dispel misconceptions whenever they are posted.
> 
> mhp, I liken your battle against Google to the battle against chatspeak that is fought in this forum on a daily basis. It seems like a losing battle (or at least neverending), but that's not a good reason to stop fighting.



HI fenixpollo,

I just saw you post now! Thank you.

Perhaps those of us in Spanish forums suffer more because of these numbers. This is only a guess, since I rarely participate in other forums. 

   I noticed that posts that referenced Google numbers were immediately deleted in a recent thread I opened in the English only forum (thanks cuchuflete) but that such posts take a life of their own in the Spanish grammar forum.


----------



## ampurdan

The following is just my personal opinion as a member of the forums:

I only search for google number of results as a first guidance for myself, it's been years since I last posted google numbers in a thread, I think. I’m a layman, but I don't like google numbers quotes for many reasons:

1.- They first give you a number that can go up to millions, but then, when you go to page two, those millions end up being just a few, perhaps just 5 or 6.  

2.- Results are very easy to misinterpret. I don't know very much how to make a relevant search sometimes. For instance, how accurate it is to search for "se le considera" site:ar to know if that phrase is used in Argentina; obviously all dot.coms in Argentina won't appear, but if the numbers are great, shall I assume that it is used? 

3.- You must be careful about what you search, even if you don't use *, "se le considera" might give me results I should not count.

4.- The links mhp has posted gave me a different number of results, not mhp's, not Timpeac's. I wonder if Google gives different results depending on the part of the world you're in. 

5.- All this is complicated and sometimes many posts in a thread are devoted to elucidate how relevant is a Google result or how should the search be performed, which in my personal opinion detracts value from it, specially when some foreros confine themselves to spit out google numbers or base all their reasoning on them.

6.- I might take Google results with a grain of salt, but since I don’t have a clue about how that mysterious algorithm works, how can I possibly know whether a grain of salt is enough or I would need some truckloads, rather? It seems very different from searching through a language corpus with actual matches to my IT illiterate self, even if it were just because language forums were conceived to be used for language searches. 

While all this does not preclude that Google might give some valuable first indication in some cases, in my opinion, all what has been said makes google numbers  unreliable and distracting enough to be worth considering somehow restricting its use in the forums*. How and to what extent, I don’t know exactly.

EDIT - Or giving some indications as to how to use it and how not...


----------



## mhp

timpeac said:


> Of course it's welcome. You just haven't proven your point sufficiently yet. You keep talking about "the uninitiated" while suggesting that because you are not in that group you have superior knowledge that should be taken at face value while being coy about the details "the initiated" are privy to.


    I wanted to sleep on this for a couple of days before replying.

  It was perhaps wrong to use the terms layman, uninitiated, or even novice, which I was thinking but thankfully didn't use. Those terms may be offensive.

  I honestly don't know how to go about explaining all of this in a way that is understandable to a person with a highly analytical mind who lacks formal or practical training in this field. 

  Let me pose just one simple question: Is the fact that any Google number over 1000 is unverifiable in dispute?


----------



## ampurdan

I only have doubts and questions about this issue.

I think my main concern is "how unreliable those estimates are". I think most people might think, ok, those are not actual hits, just an approximation, but if Google says so, the real number must not be very different.

Is there a way to know how unreliable a search is? I guess the answer is "no", for we don't know how Google gets those results, right? Does it make it totally and absolutely unreliable for hits above 1000?

Is it the same with other search engines?

Is it the same with other search engines? Sometimes I want to know whether a sentence I've made up is actually used in English, I put it between inverted commas, first I get big figures, but then when I go to the second page, I see that perhaps there were only 17 hits or fewer...


----------



## cuchuflete

mhp said:


> Let me pose just one simple question: Is the fact that any Google number over 1000 is unverifiable in dispute?



I believe it was the Sage of Baltimore, Mr. Henry Louis Mencken, who said something along these lines:  _For every difficult, vexatious problem, there is one simple, easy answer.  And it's wrong.

_1— Can we use Google to verify that a projected count of web instances of a query term is greater than 1000?  No. Or at least not without a lot a complex jumping through hoops to find some number of instances greater than 1000.  Certainly Google does not provide any easy, clear way to verify that a term may be found on more than 1000 web pages.

2— Can we use other sources to verify that relative frequencies of two or more terms, as suggested by Google "results" are generally correct?  Yes.  No training in statistics is required to make basic use of available tools.

3— It is useful to let forum members know that Google "hits" are not trustworthy as 
verifiable counts, but declaring Google results totally off-limits is just plain silly.  Common sense use of Google searches can offer useful information, and that information can be verified through other sources.


If mhp's campaign were focused on teaching people what search engine results can do reliably, and what they cannot do, I would be in full support of that effort.  Banning an information source entirely because many people don't know much about it is a poor solution to the problem.  Consider a parallel situation:  Many non-native students of a language accept the first native reply to their thread questions.  Sometimes these native replies are wrong.  Should we ban all replies by native speakers?


----------



## ampurdan

cuchuflete said:


> If mhp's campaign were focused on teaching people what search engine results can do reliably, and what they cannot do, I would be in full support of that effort.



I'd fully support that too. I'd actually sit in one of the first desks.


----------



## JamesM

So would I.   It could be a very useful addition to the forums in general, a sort of primer for using Google.

We often remind each other and new users of the pitfalls of using Google.  It is not seen as the ultimate answer by any moderator that I know.  In fact, it can be a real pain when someone decides that something must be correct English because there are over 3,000 hits for it on Google.  

That said, I don't think an outright ban is the correct action to take.


----------



## cuchuflete

We trust (or many of us, in our less cynical moments... trust) dictionaries to be reasonably accurate.  They each tend to be developed on the basis of a corpus.

We take it on faith that the words in the dicitionary are properly distinguished as current, old-fashioned, slang, offensive, obsolete, archaic, etc. based on the use lexicographers make of their respective corpora.  Yet most of us have never seen those collections of terms in use.  We don't know anything about their absolute or relative frequencies.

Should we ban dictionaries, or advise people how to use them in reasonable ways?

I invite mhp and any other forum members with good knowledge of search engine databases and forms of presentation to offer their thoughts to the moderator team.
We can certainly post guides to the effective use of search engines in a language forum context.


----------



## timpeac

mhp said:


> Let me pose just one simple question: Is the fact that any Google number over 1000 is unverifiable in dispute?


No, not by me anyway. However, my basic point is illustrated by Cuchuflete's comment about dictionaries. We don't know how they are compiled, and people have ones they like and ones that they don't but we would nonetheless expect the vast majority to at least give the correct spelling of the words they contain. 

It seems similar to me with Google. We may not be able to review hits over 1,000 and do not know how the numbers are estimated, but it doesn't necessarily follow that those numbers are utter nonsense. They might be - but I'd need some sort of reasoning before completely abandoning assumptions based on Google results.

Let me ask you a question back: If phrase a got a Google hit of 3,000 and phrase b got a Google hit of 3,000,000 would you at least say that phrase b is more common than phrase a?

If not, that's very interesting - I certainly don't want to be making incorrect assumptions - but I would still need more explanation than just your word for that before abandoning looking at Google hit numbers completely.


----------



## mhp

ampurdan said:


> I think my main concern is "how unreliable those estimates are".



   I think this is the crux of the question. The answer is: sometimes they are reliable and sometimes they are not!

  The examples I have cited are not cherry picked. They are what I see on most recent posts. If you get 700,000 hits on the first page and 25 on the third page, it is easy to verify the audacity of the claim. 

  Eventually, Google’s algorithm may decide that the phrase in question may merit further investigation, a decision reached by the number of queries issued for that particular phrase against the *cache state*, and give a more realistic numbers.

  Let’s *suppose* I am the foremost expert on what Google can or can not do. How can I give an intelligent answer about reliability of a hit is over 1000?

  Sometimes the limitations of the heuristic used by Google appear on page number 3 (converting 700,000 to 25), sometimes on page 57. If there are not enough queries, Google’s algorithm may decide the cut-off point is on page 125, which is unverifiable.

  If I don’t know about the *cache state *(an unknowable quantity), I cannot give an informed answer. My best informed advice: If you encounter a hit over 1000, try to go page 100. If Google still says over 1000 after page 100, wait and see what happens. Never _assume_ true what you can’t verify.

  Therefore, supposing I am an expert, my advice cannot be anything but: hits over 1000 are an unknown quantity. They may change drastically based on cache state.


But that is common sense.


----------



## Cagey

Consider the following:

Currently I get:
Results *1* - *10* of about *101,000* for *"my sister Sue"*.  (*0.15* seconds) .  (*0.15* seconds) 
Clicking through to the end:  Results *791* - *792* of about *101,000* for *"my  sister sue"*.  (*0.81* seconds) 
(Two days ago it was 99,100, and clicked through to 777.)

Results *1* - *10* of about *134,000* for *"my sister  Mike"* .  (*0.16* seconds) 
Clicking through to the end: Results *121* - *121* of *121* for *"my sister Mike"*.   (*0.12* seconds) 
 (Two days ago it was 136,000 and clicked through to 122.)

Yes, I have selected this particular example because it is inaccurate. However, that is because I don't have at hand any of the real searches I have made that revealed the same problem.

There are things to say about the results for "my sister Mike" such as, they are generally references to a book of that name, or  the result of coincidental juxtapositions.  However:


This would not be a discussion about language.
Many people find such discussion tedious, and don't read them.
I don't know what use is made of the raw scores in other language forums, but in the EO forum, a better argument to the same effect can be made by looking at the actual cited instances and not the raw numbers.  You can look at results on the first few pages, and comment on the contexts in which a given word or phrase is used, etc.   In any case,  it is important to check the results to see what is being counted.  This is necessary to see  whether the numbers are being inflated by the lyrics to a popular song, for instance.  If all this has been checked  it makes sense to offer the number of actual results in reputable contexts as an indication that a word or phrase is common.    

Note: It is my understanding that "Google searches" is used here as a generic term for "open-ended searches of the internet, and not intended to distinguish Google from Bing, for instance.  

At least that is what I intend to be saying.  The predictive numbers in searches that are in limited fields, such as "Google books" or "Google scholar" tend not to be so widely different from the number of citations finally given. I believe I have seen Google book estimates that are off by a factor of 2 or 3, but they are not exponentially off, as in the case of general searches of the web.  And they quickly go down once you begin clicking through.


----------



## JamesM

If I include the items that were omitted because of duplication I get a full 1000 for "my sister Sue".   Google never shows more than a 1,000 specific citations as far as I know.


----------



## Cagey

JamesM said:


> If I include the items that were omitted because of duplication I get a full 1000 for "my sister Sue".   Google never shows more than a 1,000 specific citations as far as I know.


Yes, but it is not the lower number of actual hits for "Sue" that concerned me.   I assume that anything over "600/700 hits = "lots" in Googlese. With  more complex phrases, I could be impressed by a lower number of actual hits.

I posted the search results for sake of the raw numbers, which might suggest to someone who was not familiar with English naming conventions that more sisters are named Mike than are named Sue.

As I understand it, these raw numbers and their usefulness are the specific subject of contention in this thread.  In any case, it is the issue that I am most interested in.


----------



## JamesM

Ah, I see.  So it is the conclusion drawn from the counts that is the problem.  Yes, I agree.  It can be very misleading.

I also agree that using COCA or the British Corpus or Google Books is a much more productive way to discuss the use of something.  However, some casual language won't show up in those searches, so it's not perfect.  Nothing is.


----------



## timpeac

This "my sister xxx" example is a bit of a side-track I think because, as far as I know, Google ignores full-stops in queries. So even if we knew that Google was the most accurate number-counter going I wouldn't trust a comparison between these two phrases - which will also reflect that "mike" is more common than "sue", is used in other languages and will reflect whether the verb "sue" is more common that the uses of the noun "mike" etc etc - because it will be used after a full-stop.

This is the sort of example that does show how important it is to have phrases which can be compared by a search in the first place - and also the problems such search comparisons can cause in a thread by derailing it to discussions about how to use a google search rather than discussion the original question. I agree that both these are draw-backs of allowing comments on google results.

If there is a way to formulate a query that would count the inclusion of a full-stop is exactly the sort of thing I'd love to learn, and could help form a really useful sticky on this topic. (As well as how not to make it count, because we'd want to compare "my sister Sue" and "my sister, Sue" since such a comma is optional).

mhp - you didn't answer my question.


----------



## Cagey

*I.* My point was that the raw projected numbers can be very misleading.  

In fact, comparing the numbers of the _actual citations_ gives a much more realistic picture. When you look at the citations themselves, it is easy to see that there are very few examples of "my sister Mike" that would serve as evidence that people have sisters named Mike, while there are plenty of citations in which people are clearly referring to sisters named Sue. 

That is why I oppose the use of raw numbers as evidence about usage, but endorse the judicious use of actual citations.  As James said above, there are usages that are not covered by the corpora, and the search engines are very useful for these. 

*II.* The fact that Google ignores punctuation is a problem in many searches. These particular searches are not unusual in this respect.   

Some problems may be avoidable if people have a good understanding of how the search engines work, but most people don't know much about how to set up good searches.  A good many people don't even look over their results before quoting the numbers, so they haven't even checked to see whether their search has selected the usages or constructions they had in mind. 

Every time doubtful Google numbers are quoted, the choice is between pointing it out, which is itself off-topic and runs a good risk of derailing the discussion, or letting it stand.  

At the minimum, I think that it would useful to have a sticky that advocated thoughtful use of actual citations and explained the objections or drawbacks to accepting the projected numbers as evidence of usage.  In this way, anyone who felt that search engine results were being used in a misleading fashion could direct the poster to that informative sticky, instead of repeating this non-linguistic discussion in the thread itself. 

As I have said before, in addition to other problems, many people find discussions of the statistical issues and other problems tedious and won't read them in the thread anyway.  If the information is in a sticky, at least those who are interested can find a coherent explanation there.


----------



## timpeac

Hi Cagey - I don't disagree with anything that you say there. I don't think anyone disputes that there are huge potential problems with the apparent results Google gives both because of incorrect phrasing of the query, and because anyone who disputes the incorrect query necessarily derails the thread.

Those are concerns that it is definitely worth addressing. However, I'm trying to get at whether the best phrased query possible could give incorrect comparative results (on a very broad basis) because I think this needs to be clarified first. If it is so, the rest of the question really becomes irrelevant because we simply couldn't base any assumptions no matter how broad on a Google result so it would save us time in trying to decide how best to attack the problems of poor phrasing or interpretation of the query.


----------



## mhp

timpeac said:


> mhp - you didn't answer my question.


    Hello again,

  I was away from my computer and I just read your question, so this is my first chance to answer. I have already referenced one published work on the inner workings of Google. As far as I know, Google has never published anything on how they arrive at these numbers.

  My original post is a suggestion. Now you are asking me to prove my suggestion. I’ll try. 

  Please remember that numbers reported by Google change dynamically. At the time of this writing, I get the following results:

  Results *1* - *10* of about *168,000* for *"more cowardly"*. (*0.36* seconds) 
    Results *891* - *893* of about *43,200* for *"more cowardly"*. (*1.20* seconds) 

  Results *1* - *10* of about *41,300* for *"more coward"*. (*0.12* seconds) 
    Results *511* - *518* of about *41,200* for *"more coward"*. (*0.57* seconds) 

  Results *1* - *10* of about *24,600,000* for *"more of a coward"*. (*0.34* seconds) 
    Results *461* - *464* of *464* for *"more of a coward"*. (*0.36* seconds) 

  The first line is the number of hits reported on the first page. The second line is the number of hits reported on the last page. What can you conclude from these numbers?


For your assumption (that 3000<3,000,000) to be true, the heuristic used by Google to reach these numbers must have the property that if a<b then f(a)<f(b), where f(x) is the estimate of x. This is a very strong claim in computer science and as far as I know the estimator used by Google does not have this property.



If you don't mind, I'll just sit back and see how this develops because I don't wish to prove anything.


----------



## JamesM

It would help if I could repeat your results, but I can't.

Here is the last page of clicking through on "more cowardly":

http://www.google.com/#hl=en&rlz=1R2SUNA_enUS371&q=%22more+cowardly%22&start=990&sa=N&fp=467c3568f2eec009

Perhaps it's more accurate on a repeated inquiry if it recently traversed the same citations for another query.

As I write this (and I tested the embedded link here as well) it shows "about 167,000" on the last page, which is the same number it said on page one.  It also goes the full 1,000 citations.


----------



## ampurdan

> The answer is: sometimes they are reliable and sometimes they are not!


Sorry, mhp, that's not an answer. You're just repeating the assumption I made when asking the question, that is, that sometimes it is quite accurate and sometimes it isn't.

What margin of error do Google estimations have when compared to real matches in the WWW? What would be an acceptable margin of error? How does this margin of error change when the figures increase or decrease?

A person that tells something similar to the truth half of the time and something far from the truth the other half is totally unreliable and not trustworthy for me.

Sorry, Cuchu, I don't see the analogy with native speakers or a dictionary, not anymore. Native speakers in Wordreference control one another. Dictionaries are reviewed and are subject to criticism. I guess nor Google or any other common search engine (yahoo, ask.com, bing, etc.) has been conceived as a language tool and no one controls that its estimations are really useful for language purposes, am I wrong? Google and other search engines are judged according to their ability to produce relevant results for the searcher and put the more relevant in the first places, while eliminating or downgrading hits they deem uninteresting for the searcher, aren't they?

While I used to blindly trust Google and search engines as quite reliable tools for language search (once you learn to make a somehow relevant search), now I've learned that Google results are just estimations, and that there is much uncertainty around the question of _*how often*_ results very above 1,000 are actually very above 1,000 and _*how often*_ larger estimations for "a" really imply that "a" appears more times than "b" in the Internet. 

When I click on your link, James, I'm in page 89 and I read: "Results *881* - *885* of about *26,100* for *"more cowardly"*.  (*1.09* seconds)".

Then I repeated mhp's searches:

Results *1* - *10* of about *37,100* for *"more coward"*.  (*0.20* seconds)
Results *501* - *509* of about *37,100* for *"more coward"*.  (*0.42* seconds)

 Results *1* - *10* of about *24,700,000* for *"more of a coward"*.  (*0.25* seconds) 
 Results *461* - *464* of *464* for *"more of a coward"*.  (*0.37* seconds)

When I repeat the search including omissions, I get:
 Results *571* - *578* of about *24,700,000* for *"more of a coward"*.  (*0.31* seconds) 

Results for "more of a coward" did not descend to 464, that is, it kept saying "24,700,000, until I was in page 47, the last one, with omissions.

When I repeat the search setting results per page to 100, I get:
 Results *1* - *100* of about *834,000* for *"more of a coward"*.  (*0.17* seconds) 
 Results *401* - *461* of *461* for *"more of a coward"*.  (*0.42* seconds) 
Including omissions:
 Results *501* - *571* of about *834,000* for *"more of a coward"*.  (*0.32* seconds)

My conclusion is that I agree with Cagey. Actual citations from Google or other search engines make much more sense.

Raw google numbers for two phrases might not be comparable to one another, might be very wrong and might be wrongly interpreted. Raw number quotes in a thread are likely to create confusion or distractions. I don't think it is the moderators' job to check all search engine numbers dropping for proper use, but I think they can delete topic derailments derived from use of search engine numbers that has been proved to be unwise or bad.

Banning all search engine quotes might be excessive, but taking some measure to discourage wild use of them seems appropriate to me.

If 3000 for "a" and 3,000,000,000 for "b" might not necessarily mean that "a" is less used than "b" in the Internet, it would be great to warn users about this fact too. It should not be discussed in each and every single thread such a comparison between two or more possible usages is made, in my opinion.


----------



## cuchuflete

Any query of any database will be misleading if not informed by a little common sense.

After my initial surprise at the seemingly high results for "more coward" I looked at the first page, and adjusted the query to inject a little bit of reason:

Results 351 - 390 of 390 for "more coward" -noel. (0.30 seconds) 
 Results 1 - 50 of about 10,200 for "more coward" -noel. (0.09 seconds) 

Had I spent more than 2 seconds looking at the first page, I might have made further adjustments.  

This is not a defense of Google's predictive accuracy; I cannot defend what I can neither see nor understand.  It is a suggestion that when a user sees something that appears contrary to intuition or common sense, they look at it.  

As I wrote earlier in this thread, Google "hits" are not counts.  The only counts are those on the last page of a "click through", and the maximum value at this time is 1000.


----------



## ampurdan

Thank you, Cuchu. I actually thought those were the same thing. 



> This is not a defense of Google's predictive accuracy; I cannot defend what I can neither see nor understand. It is a suggestion that when a user sees something that appears contrary to intuition or common sense, they look at it.



I agree.  

Unfortunately, non-native speakers many times do not have that kind of language intuition, that's precisely what leads them (us) to use search engines.

A wise use of a search engine seems to require many abilities.


----------



## chileno

I'm sorry for butting in, but it seems impossible to me not to, when Google is being thought of a tool of any precision based on the numbers of hits and its fluctuating results depending on the week day or whatever else.

I use Google just to see if people are using certain expressions or not, then I verify with online dictionaries such as RAE, Oxford, and Merriam-Webster.

Case in point:

I person from Perú asked abot how to translate the word "agropecuario". I searched Google for "agropecuarian" and I do not remember, but many results were found. All of them from Latin American countries and the texts were offered in their respective sites in several languages. However, the correct translation for "agropecuario" is "agricultural" as attested by the Merriam_Webster dictionary site.

Google has a purpose and it is a good one, to search and get ideas on what else might be available, and as such is invaluable, at least for me.

Thank you.

Hernán.


----------



## Cagey

chileno said:


> I'm sorry for butting in, .....


There is no need to apologize.  This is an open conversation and you are not "butting in". 

Thank you for your very clear explanation of a way to make good use of Google  results, or the results from any search engine.


----------



## chileno

Cagey said:


> There is no need to apologize.  This is an open conversation and you are not "butting in".
> 
> Thank you for your very clear explanation of a way to make good use of Google  results, or the results from any search engine.



You are more than welcome.

I confess openly that I do not command my own language, and much less pretend to command the English language, but I do try my best. In doing so, I learn.


Hernán.


----------



## timpeac

chileno said:


> I'm sorry for butting in, but it seems impossible to me not to, when Google is being thought of a tool of any precision


I don't think anyone claims that Google is a tool of any precision - it's at best just an indication.


----------



## chileno

timpeac said:


> I don't think anyone claims that Google is a tool of any precision - it's at best just an indication.



Right, but an indication trying to imbue it with precision. As I interpret what mph is trying to convey, with searches that are under the 100 or whatever mark should be accepted as ...correct/precise?

That's my interpretation and I might as well be totally off. I ask your comprehension.


----------



## Cagey

I think mph's concern is primarily with the opposite problem, with the use of the result numbers that go over 1000, that is, the numbers that are purely estimates, not counts. In some instances people use these numbers to make arguments about frequency, but the numbers are not reliable.  I share this concern. 

I hope that people will learn to make appropriate use of the Google results, which is why I appreciate your explaining how you do it.


----------



## chileno

Cagey said:


> I think mph's concern is primarily with the opposite problem, with the use of the result numbers that go over 1000, that is, the numbers that are purely estimates, not counts. In some instances people use these numbers to make arguments about frequency, but the numbers are not reliable.  I share this concern.
> 
> I hope that people will learn to make appropriate use of the Google results, which is why I appreciate your explaining how you do it.



Thank you for your kind words, but precisely what I was saying: Would less than I 1000 hits, say a 100, make it anymore reliable usage of the word(s) at stake? I don't think so.


----------



## Cagey

We are talking here about discussions of the frequency of one usage compared to another, which is where people use the numbers.  Sometimes they are trying to establish that one usage is an error, by showing that it is very uncommon compared to another, sometimes it is simply the question of which is most common that interests them for some reason.  

If you have lower numbers, you can look at the actual hits and see what  they are. Your point is valid that there is no guarantee that the proportions are accurate, but if you get several hundred for one construction and twenty or thirty for another, it seems likely that they reflect the actual proportions over the texts searched.    Of course, the contexts in which a usage is found and other things  always need  to be considered.  

If you work on bounded searches, like Google news, or books, or scholar, I do think the numbers are more likely to reflect reality in general, at least for the kind of writing being sampled.

If you are trying to decide what usage to adopt, there may be other reasons than frequency for picking one over another. I don't think that anyone in this thread is arguing that the usage with the highest number of hits is always "right".


----------



## chileno

Cagey said:


> We are talking here about discussions of the frequency of one usage compared to another, which is where people use the numbers.  Sometimes they are trying to establish that one usage is an error, by showing that it is very uncommon compared to another, sometimes it is simply the question of which is most common that interests them for some reason.
> 
> If you have lower numbers, you can look at the actual hits and see what  they are. Your point is valid that there is no guarantee that the proportions are accurate, but if you get several hundred for one construction and twenty or thirty for another, it seems likely that they reflect the actual proportions over the texts searched.    Of course, the contexts in which a usage is found and other things  always need  to be considered.
> 
> If you work on bounded searches, like Google news, or books, or scholar, I do think the numbers are more likely to reflect reality in general, at least for the kind of writing being sampled.
> 
> If you are trying to decide what usage to adopt, there may be other reasons than frequency for picking one over another. I don't think that anyone in this thread is arguing that the usage with the highest number of hits is always "right".




Then I misunderstood. My apologies.


----------

