# Banning number of hits reported by Google



## Sköll

I have seen quite a few threads where people give number of hits reported by Google as an indication of the relative frequency of usage. Unfortunately, these numbers are invariably an inaccurate representation of the actual number of cases in the database.    

  For example, at the time of this writing, Google reports:
   Results *1* - *10* of about *55,200,000* for *"hasta el punto de que tenga"*. (*0.39* seconds)
    Actual number of cases in the database: 34 

   Results *61* - *70* of about *243* for *"hasta el punto de que tiene"*. (*0.13* seconds) 
    Actual number of cases in the database: 165

  Sometimes the result is off by a little, other times by a huge margin.

  There are other databases where the number of cases in the database is reported accurately, and those results can be useful. However, given unreliability of Google in producing such *estimates*, I propose that the rules of WR be amended to discourage listing number of hits reported by Google as any indication (much less a "proof") of usage.


----------



## Sowka

Hello Sköll 

I think that in a discussion about a word and its usage many bits of information are collected, and the number of google hits is just ONE of these bits of information that has to be considered appropriately. 

I would never rely on google information alone. I normally use google to find the sites that actually contain the information, written by the right people.

So even if the number indicated by google were correct: What does it tell me? Do I really want to speak and write like the majority of people, just because it's the majority? 

Everybody has to make this evaluation for himself or herself.

Good evening to you


----------



## Sköll

I agree with you Sowka that Google is an indispensible tool in locating resources. However, the number of hits reported by Google is based on some heuristics that produces remarkably inaccurate result. If the number reported were, in fact, a somewhat accurate estimate, I would consider that an extra bit of information. The fact that some phrase is used often is indeed very useful information. All I am saying is that Google often gives incorrect information about frequency of usage. It is difficult to point this out every time people use this "information" in supporting their point of view.


----------



## Sowka

I see  But in my opinion, it would be doubtful if WRF banned the numbers. There are so many unreliable sources out there.. 

I feel that the only way is: Everyone has to learn to understand the quality of sources and their respective limits. You can't deal with such a question on an administrative level. People have to gain experience, commit the errors that they need for their understanding.. In my opinion, this is all part of the process.

If you feel you have to point out the "google-number-alone-is-no-indicator" problem often, you could save an explanatory text in a text document and copy and paste it whenever needed (or make a link to your initial post here). 

OK, that's just my point of view.


----------



## Sköll

I understand your point about this being a learning process. However, look at this thread. Here I dared to point out that the number of hits is so inaccurate as to render any comparison meaningless. The result is a complete disruption of the thread. It is interesting that even when I point out that instead of 700,000 there are actually 279 cases, I’m still not understood: Someone mentions "uneducated use" across the internet, another person simply dismisses what I had said and proposes an additional phrase that should also be considered in the "count"; mind you, by the same method that produced an estimate of 700,000 for 279 cases.


----------



## GavinW

Skoll,
I think you're making a very important point, and I'm aware it's not an easy point to make. I think your point hinges on the fact Google does, in fact, provide estimates, not a count. And I would hazard the opinion that you're possibly overestimating other people's ability to detect/appreciate the difference between the two things (I was totally unaware Google only offered "estimates" until you pointed it out, but have always had an instinctive antipathy to the citing of Google hits in WR threads, which happens fairly frequently).
I sympathise with your difficulties in trying to restore rigour to a linguistic discussion when such misleading statistics are cited as "information" rather than the product of what you correctly describe as flawed "heuristics".
I think there's a strong case for imposing a ban on references to Google "hits", despite the apparent extremeness of such a move.


----------



## NewdestinyX

Skoll,
I couldn't disagree more with your conclusion that Google hits are useless for language study and hits = usage. Though I share your concern that people don't know HOW to get solid and useful results -- it's only from a lack of knowledge many times (not all the time) in using Google as a tool and its various filters and Boolean tools.

Let me make my case by picking up on your last comment in the other thread where this came up.



Sköll said:


> But let me try one last time: The numbers you are quoting are meaningless. They are output of a program that tries to come up with an estimate in a fraction of a second without actually scanning the entire database. Please do the following:
> 
> 1. Search for "*cómo es de bueno*". (it should report about 13,000 hits)
> 2. At the bottom of the page, go to page 50. (you shouldn't be able to)
> 
> You get an actual of about 330 hits. The other 12,000 hits are just not there. They were never there!


I'm sorry to disagree with the implication you're making. Though going to page 50 doesn't work as you say, initially -- There are many more than 300 'actual', unique entries, Sköll. Simply repeat the search 'without omitted entries as it says on the last page 33. I'm getting to page 57 and more -- and the reported 'seconds' for finding the search is longer.

It is a very useful language learning tool that every student should avail themselves of. I have 10 years of proof of its help and usefulness.

Grant


----------



## NewdestinyX

Skoll,
I've done even more testing today.. and I have to modify my position a little.

I still don't agree that Google is a useless tool in establishing commonness of usage in comparison in our linguistic study.. but something '*does*' seem to be '*broken*' as of recent. This 'estimated' amount being way higher than the actual is a *new* problem. This has not been the case for the last 10 years.* All *results have been available in my searches these last 10 years. Possibly Google got so large that they are limiting the reports to us to save bandwidth. As you follow a given search - even with your preferences set to 100 per page -- Google does stop giving you results after about 10% of the number they report on the first search.

*But*--- again -- the results are accurate for extablishing commonness of usage -- the ratios are the same as the opening number they report.

Instead of 167,000 versus 14,780 -- the actual is 1200 versus 480. Still it's clear to conclude that the Peninsular Spanish idiom is used less than the Latin American one -- as you would expect.

Thanks for bringing this limitation to our attention. I will be writing Google for an explanation of their thinking. I was actually part of the early 'beta testing' of Google and was very excited about the tool. I repeat -- them not allowing access to 'all' in their database is a new development for sure.

Grant


----------



## Kevin Beach

Whether or not Google reports the number of hits properly is only one problem.

The other is that the internet is almost wholly unedited, unappraised, uncensored and uncorrected. Internetters feed off each other's ignorance and misinformation. Many sites simply pull information from others and present it as if it were the absolute truth, but it is often rubbish.

The fact that Google might return XX,000 examples of a particular usage of a word or phrase does not mean that XX,000 people have used it in that way. Nor does it mean that it can be compared with YY,000 examples of another usage with statistical reliability.

The only way to be guided by Google entries is to access as many of the listed pages as is reasonable in the time available and then ask "Could this be an accurate use of what I am looking at?"

Google can be a very useful tool, but for these purposes it is a sledgehammer, not a scalpel.


----------



## Mate

Kevin Beach said:


> Whether or not Google reports the number of hits properly is only one problem.
> 
> The other is that the internet is almost wholly unedited, unappraised, uncensored and uncorrected. Internetters feed off each other's ignorance and misinformation. Many sites simply pull information from others and present it as if it were the absolute truth, but it is often rubbish.
> 
> The fact that Google might return XX,000 examples of a particular usage of a word or phrase does not mean that XX,000 people have used it in that way. Nor does it mean that it can be compared with YY,000 examples of another usage with statistical reliability.
> 
> The only way to be guided by Google entries is to access as many of the listed pages as is reasonable in the time available and then ask "Could this be an accurate use of what I am looking at?"
> 
> Google can be a very useful tool, but for these purposes it is a sledgehammer, not a scalpel.


My thoughts, exactly. Nicely put, Kevin Beach.


----------



## cuchuflete

Mateamargo said:


> My thoughts, exactly. Nicely put, Kevin Beach.



Ditto, Mateamargo and Kevin Beach.  We should extend this beyond the named search engine to its competitors as well.  A larger database, built by spiders that index more sites, may exacerbate the flaws in estimates at times.  It may be more accurate at others.  Users have a responsibility to know what they are quoting, and forum readers need to be aware of both the benefits and limitations of tools such as search engines.

A _corpus_, such as those available for BE and AE, is generally a more precise tool.


Would I support the banning of search engine estimates?  Certainly not.  That could easily lead us down the path of banning quotations from dictionaries.  Some dictionaries are of excellent quality, while others are not.  Should we proscribe citations from those of inferior quality?  I would prefer to see all available contributions to a thread discussion, including rebuttals when search engine results are presented as fact rather than as general indicators.


----------



## NewdestinyX

Kevin Beach said:


> Whether or not Google reports the number of hits properly is only one problem.
> 
> The other is that the internet is almost wholly unedited, unappraised, uncensored and uncorrected. Internetters feed off each other's ignorance and misinformation. Many sites simply pull information from others and present it as if it were the absolute truth, but it is often rubbish.
> 
> The fact that Google might return XX,000 examples of a particular usage of a word or phrase does not mean that XX,000 people have used it in that way. Nor does it mean that it can be compared with YY,000 examples of another usage with statistical reliability.
> 
> The only way to be guided by Google entries is to access as many of the listed pages as is reasonable in the time available and then ask "Could this be an accurate use of what I am looking at?"
> 
> Google can be a very useful tool, but for these purposes it is a sledgehammer, not a scalpel.


Yes - agreed - that it's more a sledgehammer than a scalpel -- but in terms of determining a general rule of thumb for 'commonness' of usage of a phrase -- the ratios are still very accurate - as long as you know how to set the Boolean keys correctly to filter out unwanted results. Most people don't and therefore it is sometimes not useful at all. Search engines, when used correctly, are a valuable tool and I hope they're never banned from usage here. That would be a mistake.

Respectfully submitted,
Grant


----------



## Kevin Beach

NewdestinyX said:


> Yes - agreed - that it's more a sledgehammer than a scalpel -- but in terms of determining a general rule of thumb for 'commonness' of usage of a phrase -- the ratios are still very accurate - as long as you know how to set the Boolean keys correctly to filter out unwanted results. Most people don't and therefore it is sometimes not useful at all. Search engines, when used correctly, are a valuable tool and I hope they're never banned from usage here. That would be a mistake.
> 
> Respectfully submitted,
> Grant


Grant, this is a genuine question and not rhetorical - how do you know that Google results are very accurate if all you have to compare them with are more Google results?


----------



## GavinW

cuchuflete said:


> Would I support the banning of search engine estimates? Certainly not. That could easily lead us down the path of banning quotations from dictionaries.


 
While I find the points you make fairly persuasive, and have in no way decided it would be a good idea to press for banning citations of Google "statistics", I can't agree that there's a necessary link, or a cline, from search engine estimates to dictionary quotations. I believe we are looking at two different categories of "information." One (dictionaries) is, at least(*), clearly intended for the use to which it is put (reference and consultation), the other is not (as has been well argued by others).

(*However, as a onetime lexicographer on bilinguals and monolinguals, I feel sorry for those who haven't learnt to use dictionaries critically!)  

That said, I find this debate useful and informed, and am grateful to all contributors to it thus far. It's looking less and less cut and dried to me.


----------



## Paulfromitaly

I believe that one point has been made very clearly: the main issue is not how reliable Google hits can be, but rather whether people may or may not be able to read and interpret them. 
The same problem lays with statistics in general: if people cannot interpret them, statistics are virtually useless if not misleading.


----------



## NewdestinyX

Kevin Beach said:


> Grant, this is a genuine question and not rhetorical - how do you know that Google results are very accurate if all you have to compare them with are more Google results?


Fair question, Kevin. But easy to answer. Go to the links themselves - for each hit. I mean go to 'each' of 100 links or so and read a bit. Over 10 years of doing so I am confident that the hits referenced appear in legit web articles, media or forums.. That is the 'real' number - as Sköll has pointed out, unbeknown to me, that you only get to see about 10% of the number Google reports. That's new. But the ratios from search to search are still accurate even at the low sampling.

Chao,
Grant


Paulfromitaly said:


> I believe that one point has been made very clearly: the main issue is not how reliable Google hits can be, but rather whether people may or may not be able to read and interpret them.
> The same problem lays with statistics in general: if people cannot interpret them, statistics are virtually useless if not misleading.


Well said!! And I agree wholeheartedly.

Thanks,
Grant


----------



## Sköll

It is true that last night Google was reporting 700,000 cases of 'cómo de bueno es'---I saw it. A few hours later that number was changed to a different number. Today it says 1,500. 

Last night Grant was comparing 700,000 to 167,000 and drawing a certain conclusion. Today he is comparing 14,700 to 167,000 (he still has not updated his post to reflect the latest number reported by Google, which is 1500) and is drawing a different conclusion. It will not be surprising at all, if at some time in future 165,000 is also changed to a completely different number. What sort of conclusion can be drawn from such gross errors in statistical data? 

  Do we really want to turn all threads where these guestimates are quoted to a discussion about validity of these numbers?


----------



## NewdestinyX

Sköll said:


> It is true that last night Google was reporting 700,000 cases of 'cómo de bueno es'---I saw it. A few hours later that number was changed to a different number. Today it says 1,500.
> 
> Last night Grant was comparing 700,000 to 167,000 and drawing a certain conclusion. Today he is comparing 14,700 to 167,000 (he still has not updated his post to reflect the latest number reported by Google, which is 1500) and is drawing a different conclusion. It will not be surprising at all, if at some time in future 165,000 is also changed to a completely different number. What sort of conclusion can be drawn from such gross errors in statistical data?
> 
> Do we really want to turn all threads where these guestimates are quoted to a discussion about validity of these numbers?


Actually read the thread again, Sköll. I updated those statistics in the other thread earlier today. You should edit your posts where you quoted me too. And I will in future posts about it - Until Google fixes this 'problem' -- use the 'real numbers you refer to to make my claims about usage. That's a fair request of you.

But please stop referring to Google as a 'haphazard' untrustworthy search engine. You're going too far with that analysis. The results are not haphazard and there's likely a problem they're experiencing of recent. Today I put in the same request in and the number of hits kept jumping from 10,000 to 15,000 and back lower - so something's up. A number going 'up' consistently each day would be accounted for from simply new web articles appearing on the net. But going 'down' is weird.

And thanks for acknowledging that you too saw the 700,000 for 'cómo es de bueno'. That's really weird.

There is definitely a problem with Google in these last couple days. Sadly - because they're biggest SEng out there they won't admit to any problems. But -- still - when no technical problem exists we can reference it with confidence. Ah -- now when to know there is 'no problem' --that is the problem... right?  Fair enough.

Grant


----------



## Sköll

NewdestinyX said:


> ...You should edit your posts where you quoted me too. ...But please stop referring to Google as a 'haphazard' untrustworthy search engine. You're going too far with that analysis.


 
Do you mean that you want me go back and edit what you said to reflect the latest thing you are saying? I'm afraid that won’t be possible as the thread is locked. But a very interesting request nonetheless.

I have no comments about the way you have quoted me.


----------



## NewdestinyX

Sköll said:


> Do you mean that you want me go back and edit what you said to reflect the latest thing you are saying? I'm afraid that won’t be possible as the thread is locked. But a very interesting request nonetheless.
> 
> I have no comments about the way you have quoted me.


We can agree to disagree on the usefulness of Google. You just made a comment about something I hadn't done yet -- that I did do with regard to setting the record straight. I just wanted your info to be current. That's all.

Thanks for the discourse. I think this has been a good discussion. I will continue showing Google results as proof of commonness of usage as I'm convinced it's vital in establishing and supporting points for input to the forum. I will however look for 'real numbers' in my comparisons.

Thanks all,
Grant


----------



## Loob

I'm going to be brief (unusually for me)

I think we have to be _*exceedingly*_ careful with the google totals given on the first page of results: these are often very different from the totals you see on page 5 or page 10, which represent "number of unique examples".

In this, I agree 100% with Sköll.

That said, I do think google-hits can be a useful indicator of relative frequency.  I would not want to see them banned.


----------



## Sköll

Loob said:


> I think we have to be _*exceedingly*_ careful with the google totals given on the first page of results...


   I think it is safe to say that we all have used Google totals at some point. It is very tempting to throw numbers around, especially when those numbers seem reasonable in our opinion or, even better yet, when they support what seems reasonable to us. So I understand the reluctance to do away with what seems to be useful.

  Can we be *exceedingly *careful and give better, more educated numbers from heuristics used by Google to arrive at better more reliable estimates?

  Grant is saying that by some coincidence the inconsistencies we see in these numbers started about the same time I started this thread. That can be true---at least, it is not entirely impossible. 

  Along the lines of these inconsistencies, the number of hits for "qué tan bueno es" was consistently reported as 167,000 for the last couple days or so. Even when you got to the last page (about page 75) it still was saying 167,000. In the last few hours that number has changed to about 58,000. It still says 58,000 on the last page. This number is reported by both google.com and google.co.uk. However, if you go across the Channel to google.fr, or a little further south to google.es, this number change to about 10,000. 

  I am not making this up! It seems as more queries are made, these estimates become more accurate for a particular geographic server. I have a strong feeling that at some time in future google.com will catch up to google.es and report 10,000 for this particular phrase if we keep issuing queries.

  After a few days of looking at these numbers, is 1500:10000 a better estimate than 700000:167000? Considering that the two phrases under consideration are, in fact, the difference in usage between Spain and most of Latin  America, we can say that could be true in this case. But this conclusion is reached by knowing how these phrases are used in Spain and the rest of the world, not by keeping track of how these numbers are changing.

Are we really here to look at numbers such as 700000:167000 or 1500:10000. Are we here to evaluate how Google is arriving at these numbers? 

Now, I get off my soap box. I’m out of breath.


----------



## JamesM

Just as a side question (and in the interest of using the final page number rather than the original estimate), do you know a way to get to the last page of a Google search quickly? The only way I know is to click on the highest page number displayed over and over and over... until I reach the end.

I can't imagine doing that with 167,000 hits, so I'm hoping that you know a faster way to the last page.

Also as a side note, Google can be helpful in showing infrequency as well.  When someone claims that it's a common phrase and Google finds only three hits, one of which is the WordReference thread, you can be fairly certain that it's not a common phrase.  

I agree that it's important to be careful, but I wouldn't want to see them banned.


----------



## Sköll

As far as I know, and this is by no means anything other than personal experience, Google cuts the search results to the first 100 pages or so. The fast way that I know to get there is by jumping 9 pages at a time. The only indication that there are more results than listed on the last page is on the top line where it says something like “Results 911 - 914 of about 426,000,000 for hello”.  But that result can also be wrong as we have seen in the case of more complex search queries that involve proximity—i.e. quoted literals.

I also agree that when the result is at the other extreme (only a few hits), that is a very good indication that such usage is not common. Arguably, that information can be given by any native speaker rather than performing a comparative study. But I have seen instances where, shall we say, a speaker is confused.


----------



## NewdestinyX

Sköll said:


> Can we be *exceedingly *careful and give better, more educated numbers from heuristics used by Google to arrive at better more reliable estimates?
> 
> Grant is saying that by some coincidence the inconsistencies we see in these numbers started about the same time I started this thread. That can be true---at least, it is not entirely impossible.
> 
> Along the lines of these inconsistencies, the number of hits for "qué tan bueno es" was consistently reported as 167,000 for the last couple days or so. Even when you got to the last page (about page 75) it still was saying 167,000. In the last few hours that number has changed to about 58,000. It still says 58,000 on the last page. This number is reported by both google.com and google.co.uk. However, if you go across the Channel to google.fr, or a little further south to google.es, this number change to about 10,000.
> 
> I am not making this up! It seems as more queries are made, these estimates become more accurate for a particular geographic server. I have a strong feeling that at some time in future google.com will catch up to google.es and report 10,000 for this particular phrase if we keep issuing queries.<Big Snip>
> 
> Now, I get off my soap box. I’m out of breath.


We're actually aligning the more we talk Sköll. I think there is something amuck at google.com that isn't a problem on the other servers. But yes -- the ratios are still reliable and that's important. And like JamesM just said the Google hits are also an important tool to prove 'non-common' usage. I am grateful to you that you opened our eyes to the inconsistency of numbers and 'real' count. That's an important piece of this equation. 

It seems like you're more accepting of the idea that we shouldn't 'ban' them as your title of this thread demands. 

But I am 'frustrated' as you are that the count seems to be 'padded' at America's server. Where it's more accurate elsewhere. As I said - this is a 'new' development. I've personally been on page 211 before and have seen hit 12,300-12,350 of 4,010,000 hits. So something is 'broken'. :-(

Chao,
Grant


----------



## cuchuflete

While examining the joys of one search engine, we can marvel at the mysteries of some others.  I think the cautionary tale is not limited to one of these:



> Results 1 - 10 of about 3,270 for "It's been years that the underwater monster".  google page 1
> Results 91 - 100 of about 792 for "It's been years that the underwater monster". (0.94 seconds)
> google page 10
> Results 151 - 157 of 157 for "It's been years that the underwater monster". (0.47 seconds)
> google page 18 (last page number listed)
> 
> 
> Yahoo
> 1 - 10 of 5,010 for "It's been years that the underwater monster" (About) - 0.41 s  page 1
> 
> 101 - 110 of 4,860 for "It's been years that the underwater monster" (About) - 0.86 s   page 11
> 
> 351 - 360 of 4,100 for "It's been years that the underwater monster" (About) - 0.72 s  page 36
> 
> 581 - 583 of 4,100 for "It's been years that the underwater monster" (About) - 0.82 s   page 59, last page listed
> 
> 
> Ask.com
> 
> Showing 1-10 of 617 for
> "It's been years that the underwater monster"   page 1
> 
> final page, 13:
> 
> Showing 131-133 of 133 for
> "It's been years that the underwater monster"


----------



## Sköll

To add to the mystery, I get an entirely different result from server www.google.com (connected from an IP in Spain) on June 24, 2009 4:08 PM GMT+01. 

  Results *81* - *85* of *85* for *"It's been years that the underwater monster"*. (*0.17* seconds) (With omitted result not included, last page listed is page 9.)

   Results *321* - *330* of *330* for *"It's been years that the underwater monster"*. (*0.14* seconds) (With omitted result included, last page listed is page 33.)

  And the number of hits for "qué tan bueno es" is back to 168,000!


----------



## Cagey

Here is a closely related thread, though it deals with only one search engine:Using Google "counts" as authority​On it, there is a link to a discussion at Language Log of Google counts, which you may find interesting.  The linguists, who use search engines in their research, have gotten odd and contradictory results. In the discussion they say something about how the search engine works, but their conclusion is that the explanation lies in the formula that Google uses to _estimate_ the possible number of hits out there, should Google search the whole web, which it doesn't.   That formula is a trade secret.  Also, it changes from time to time as Google improves it, so will give different results.

As a side note: No matter how many hits Google finds, it never gives more than 999 actual results.  That had been my experience when clicking through to the end, but I was pleased to see it verified somewhere, perhaps even in a Google site.  Unfortunately I don't remember where that was.  This means, however, that you can't use the actual results to compare the relationship of two usages when both get very large numbers.  The actual sites that Google links to will be about 900 in both cases.  

(I, too, would like to know a short cut for getting to the last citation.)


----------



## NewdestinyX

Cagey said:


> As a side note: No matter how many hits Google finds, it never gives more than 999 actual results.  That had been my experience when clicking through to the end, but I was pleased to see it verified somewhere, perhaps even in a Google site.  Unfortunately I don't remember where that was.  This means, however, that you can't use the actual results to compare the relationship of two usages when both get very large numbers.  The actual sites that Google links to will be about 900 in both cases.


Sorry to disagree. That has not been my experience. With large results the ratios are still preserved. This particular search we've been doing with 'qué tan bueno es' shows that. Even though the orig numbers are 167,000 versus 15,000 the smaller 'actual' pages come up with the same ratio when you search the last page. I did this search with a phrase that comes up with millions of hits and ratios were still preserved. So it's still accurate for comparison purposes. 

Grant


----------



## Sköll

Cagey said:


> (I, too, would like to know a short cut for getting to the last citation.)


   You can try a link of the form: http://www.google.com/#q="hello+there"&start=120

Put this in your browser’s URL box (not Google search box.) That takes you to 121st result if it exists in the database. But if you put this number too large, typically over a 1000, it will simply tell you that no results were found. I think if you like to go the last page, which is typically around 100 when the result set is large, then 990 or so should do the trick.


----------



## NewdestinyX

Sköll said:


> You can try a link of the form: http://www.google.com/#q="hello+there"&start=120
> 
> Put this in your browser’s URL box (not Google search box.) That takes you to 121st result if it exists in the database. But if you put this number too large, typically over a 1000, it will simply tell you that no results were found. I think if you like to go the last page, which is typically around 100 when the result set is large, then 990 or so should do the trick.


Great shortcut, Sköll!!


----------



## Cagey

NewdestinyX said:


> Sorry to disagree. That has not been my experience. With large results the ratios are still preserved. This particular search we've been doing with 'qué tan bueno es' shows that. Even though the orig numbers are 167,000 versus 15,000 the smaller 'actual' pages come up with the same ratio when you search the last page. I did this search with a phrase that comes up with millions of hits and ratios were still preserved. So it's still accurate for comparison purposes.
> 
> Grant



For "_there are no_" I get 349,000,000 claimed results and 893 actual results.  

For "_hello there_" I get 12,500,000 claimed results, and 835 actual results.

Note: Google will redo the above searches when you hit the link, so the numbers you see may be different.

_ And thank you very much for the trick, Sköll.  It worked! _​


----------



## NewdestinyX

Cagey said:


> For "_there are no_" I get 349,000,000 claimed results and 893 actual results.
> 
> For "_hello there_" I get 12,500,000 claimed results, and 835 actual results.
> 
> Note: Google will redo the above searches when you hit the link, so the numbers you see may be different.
> 
> _ And thank you very much for the trick, Sköll.  It worked! _​


Not to be too nitpicky here.. But after all I've read about Google's famous search engine and its algorithm for determining hits - I think it's more accurate to say 12,500,000 'projected total hits' and 835 'sampled results'. To say 'claimed' and 'actual' is making the assumption that 835 hits on the whole Internet is really more of an accurate projection which really seems like a silly notion to me. If Google, for bandwidth sake, allowed a projection of all hits -- I do believe confidently that it would be way more close to 12,500,000 than to 835. Is anyone disagreeing with me there?

Just because they don't let us sample 'all' hits - doesn't mean their search engine is 'making up bogus hits' as a padding. I don't think anyone's implying that here. Are we?
Yes, I am believing that they're 'projecting' the rest -- and that's a 'fallible' undertaking -- but without it being untrustworthy in determining ratios for comparison and commonness when comparing a total result of 300 to 12,000,000.

Grant


----------



## Cagey

I will accept the changes in terminology.  

These were just the words that occurred to me, but yours are much more precise, for the reasons you say.  (Though it's possible that I have less confidence in the projection than you do.)


----------



## NewdestinyX

Cagey said:


> I will accept the changes in terminology.
> 
> These were just the words that occurred to me, but yours are much more precise, for the reasons you say.  (Though it's possible that I have less confidence in the projection than you do.)


Fair enough..


----------



## Cagey

Oh dear!  I don't want to be disruptive, but ....

On second thought, I don't think I want to describe the citations that appear as "samples".   This would imply that I believed they were drawn from a large number of sites actually found.   I don't think that when the projected results are 600, but 12 citations are given, that those twelve are drawn from a larger pool.  In fact, in cases like this, the numbers of the projected results usually decrease as you click though to the last page, until the two numbers are the same.

I will settle for "projected results" and "citations [presented]" as more neutral terms.


----------



## NewdestinyX

Cagey said:


> Oh dear!  I don't want to be disruptive, but ....
> 
> On second thought, I don't think I want to describe the citations that appear as "samples".   This would imply that I believed they were drawn from a large number of sites actually found.   I don't think that when the projected results are 600, but 12 citations are given, that those twelve are drawn from a larger pool.  In fact, in cases like this, the numbers of the projected results usually decrease as you click though to the last page, until the two numbers are the same.
> 
> I will settle for "projected results" and "citations [presented]" as more neutral terms.


Well you've confused me a bit -- though your terms are still fine with me. I do believe and can prove that the 'cited' actual hits are from real websites and forums where the terms appear. How? Because I've followed hundreds of them and can show that they are all 'real' hits to real sites. The 'projected' extra hits are not  necessarily verifiable on a one to one basis but are exemplary as a projection of many real sites that Google found in the development of their engine when the internet was smaller. And I do believe you could find 90% more of their projected hits as being 'real websites'. 

So maybe we agree there and maybe we don't. But I'm glad there's no move by the management of the foro to ban these results for language study.

Thanks,
Grant


----------



## danielfranco

I propose that all people who quote Google hits as an indicator of anything at all other than what it actually is, a counter, to be banned and all their comments be jettisoned from the site, so that the only mark of their existence would be a cached page in Google.

No, but seriously, I think that using the Google hit counter as indication of anything at all is an instance of faulty logic. A leads to B. B could lead to C. Therefore, A equals C. Huh? Since when? Anyway. No, don't ban nothing and nobody. I would recommend that those who find attribution to their opinions in more reliable sources take the time to mock and deride those who only use Google as their language guru.

Okay, fine, that's unacceptable, too…

Let's use language reference sources to deal with language stuff, and search engines to look for them, how's that?

No?

Hmm…
[vexed]

D


----------



## Grefsen

Sköll said:


> I also agree that when the result is at the other extreme (only a few hits), that is a very good indication that such usage is not common. Arguably, that information can be given by any native speaker rather than performing a comparative study. But I have seen instances where, shall we say, a speaker is confused.


Thanks for starting this very interesting and informative thread *Sköll.*  Great timing too!  I haven't posted to this forum in over six months and came here today specifically just to discuss this topic. 

Most of what I wanted to write about has already been discussed quite well by many of the others.  However, I did want to add that one of the best uses I have found for the Google results wrt language phrases has been for the results "at the other extreme (only a few hits)."  

If I'm not sure about using a phrase, I will often do a Google search first to find out how often this phrase has been used or to learn whether or not the phrase has in fact ever been used before.  If I end up with "No results found" or even just a few hits, then at least I know with a fairly high degree of certainty that the phrase as I have written it is probably incorrect.


----------



## cuchuflete

Grefsen said:


> If I end up with "No results found" or even just a few hits, then at least I know with a fairly high degree of certainty that the phrase as I have written it is probably incorrect.


Or...highly original, creative, artistic... 
Just imagine if Billy Shakespeare, creator of so many wonderful neologisms, had been constrained by search engine hits.


----------



## Grefsen

cuchuflete said:


> Or...highly original, creative, artistic...
> Just imagine if Billy Shakespeare, creator of so many wonderful neologisms, had been constrained by search engine hits.


This is a good point you make *cuchuflete*.  

In fact, this was one of the reasons why I wrote "I know with a *fairly high degree of certainty* that the phrase as I have written it is *probably incorrect*," instead of simply "it has to be wrong."     It is always good to allow for the possibility that I may have actually managed to come up with something "highly original, creative, artistic..."


----------



## Wilma_Sweden

Personally, I don't trust google to tell me how to use a particular word or phrase in English. I use the two major corpora* for that purpose. That way I can at least be sure that the samples cited are produced by native speakers, and the results are categorised according to register, such as spoken language, newspapers, academic, etc. And last but not least, you can mix parts of speech with words in your searches, e.g. hit as a noun. 

Sometimes you can use google to find obscure words even if you're not sure how to spell them as you get a suggested word if you type in a misspelled one. It doesn't always work though - you get 7,280,000 hits for cardiothora*s*ic and 7,290,000 hits for cardiothora*c*ic, and as far as I'm aware, only the second spelling is correct. So much for Google statistics... 

* Some of the major corpora like the BNC (=British National Corpus) or the COCA (Corpus of Contemporary American English) can be accessed from this site: http://corpus.byu.edu/

/Wilma


----------



## NewdestinyX

Wilma_Sweden said:


> Personally, I don't trust google to tell me how to use a particular word or phrase in English. I use the two major corpora* for that purpose. That way I can at least be sure that the samples cited are produced by native speakers, and the results are categorised according to register, such as spoken language, newspapers, academic, etc. And last but not least, you can mix parts of speech with words in your searches, e.g. hit as a noun.
> 
> Sometimes you can use google to find obscure words even if you're not sure how to spell them as you get a suggested word if you type in a misspelled one. It doesn't always work though - you get 7,280,000 hits for cardiothora*s*ic and 7,290,000 hits for cardiothora*c*ic, and as far as I'm aware, only the second spelling is correct. So much for Google statistics...
> 
> * Some of the major corpora like the BNC (=British National Corpus) or the COCA (Corpus of Contemporary American English) can be accessed from this site: http://corpus.byu.edu/
> 
> /Wilma


I love the corpora. But often they're unwieldy to learn how to use. And there are many filters in Google that people simply don't take the time to learn to use. You can get as accurate a result as you wish in Google.. it just takes some time to learn.. I guess as much time as it would take to learn the corpora. Plus on Google you can search multiple languages. Not all languages have corpora online. Like the only Spanish one online is for Peninsular Spanish only.

Grant


----------



## Sköll

Wilma_Sweden said:


> you get 7,280,000 hits for cardiothora*s*ic and 7,290,000 hits for cardiothora*c*ic, and as far as I'm aware, only the second spelling is correct.


   That's another example that clearly shows Google totals are essentially meaningless. I simply don't believe there are over 7,000,000 instances of "cardiothora*s*ic" in Google databases. As has been pointed out repeatedly, totals reported by Google do not represent any actual counts; they are simply a best effort estimate that can be off by factor of thousands. Not only are these numbers unreliable, they are often not even reproducible: 7,000,000 can easily change to 1000 next time you ask for the same count.   Sometimes, you can get a more believable count by going to last page; other times, as is the case here, you cannot (it lists about 700 occurrences but insists on 7,000,000 hits.)

  As for "cardiothora*s*ic", if you search for the same word with quotation marks, you get about 4,500 hits. What 7,000,000 and 4,500 represent is unknown. This not true for "other" corpora where the actual number of occurrences in the database is reported correctly---with the word "other" being used very loosely since it implies that Google was intended to be used as such a corpus; it is not.

Edit: For the correct spelling of the word, you can consult a dictionary:
thoracic: of, relating to, located within, or involving the thorax (MWD)
Or even use Google by typing: "define: cardiothoracic"


----------



## NewdestinyX

Sköll said:


> That's another example that clearly shows Google totals are essentially meaningless. I simply don't believe there are over 7,000,000 instances of "cardiothora*s*ic" in Google databases. As has been pointed out repeatedly, totals reported by Google do not represent any actual counts; they are simply a best effort estimate that can be off by factor of thousands. Not only are these numbers unreliable, they are often not even reproducible: 7,000,000 can easily change to 1000 next time you ask for the same count.   Sometimes, you can get a more believable count by going to last page; other times, as is the case here, you cannot (it lists about 700 occurrences but insists on 7,000,000 hits.)
> 
> As for "cardiothora*s*ic", if you search for the same word with quotation marks, you get about 4,500 hits. What 7,000,000 and 4,500 represent is unknown. This not true for "other" corpora where the actual number of occurrences in the database is reported correctly---with the word "other" being used very loosely since it implies that Google was intended to be used as such a corpus; it is not.
> 
> Edit: For the correct spelling of the word, you can consult a dictionary:
> thoracic: of, relating to, located within, or involving the thorax (MWD)
> Or even use Google by typing: "define: cardiothoracic"


Well first and foremost -- you 'never' do a google search without the 'quotes'. Secondly. I do believe their 'guess' about what's on the 'whole' internet is more accurate than your term: meaningless. What I'm willing to accept is that searches that come up with no more 'projected' results than their pages display are more reliable for comparison that the ones that return millions of hits. It's also very reliable if you find results of 2,500,000 as compared with 750 projected hits.. You can be sure that the phrase with 2,500,000 hits is way more reliable than the one with 750.

I'm not willing to allow the final word of this thread to be that you can essentially throw out Google searches as complete nonsense. That couldn't be further from the reality. 

Grant


----------



## Sköll

NewdestinyX said:


> I'm not willing to allow ...


    There is no reason to invent new terms, these numbers are simply estimates and they can be off by several order of magnitude. There is a good reason why Google uses a heuristic estimate rather than actual count---it has to with the time it takes to execute a query.  Are you saying that you believe the following "data" is accurate by any measure?

  Cardiothora*s*ic: 7,280,000 hits 
  Cardiothora*c*ic: 7,290,000 hits

Edit: For comparison, Yahoo reports

  Cardiothora*s*ic: 13,100 hits
Cardiothora*c*ic: 5,590,000 hits

Also an estimate.


----------



## NewdestinyX

Sköll said:


> There is no reason to invent new terms, these numbers are simply estimates and they can be off by several order of magnitude. There is a good reason why Google uses a heuristic estimate rather than actual count---it has to with the time it takes to execute a query.  Are you saying that you believe the following "data" is accurate by any measure?
> 
> Cardiothora*s*ic: 7,280,000 hits
> Cardiothora*c*ic: 7,290,000 hits
> 
> Edit: For comparison, Yahoo reports
> 
> Cardiothora*s*ic: 13,100 hits
> Cardiothora*c*ic: 5,590,000 hits
> 
> Also an estimate.


Google's is way more accurate a 'guess'. Yahoo's algorithm is a toy compared to Google's. Now the problem is -- that Google searches so many more 'types' of documents -- that's why you're getting so many more hits for the wrong spelling. Google searches forums and ads and even 'twitter' feeds now. To tell you the truth I think that's what's 'watering' down Google's results these days. As I said at the outset - I've been with Google since the earliest of days and using it for language. And as I said several posts ago. I've personally been on page 600 at result# 12,425-12,450 of 2,500,000 hits. I don't know when Google stopped allowing you to view 'all hits'. Either something's broken or they decided it took too much bandwidth. 

For language learning we're not dealing with spelling examples like you're giving there. So it's a 'straw man' argument. We're not looking to spell things correctly when we cite Google on a language forum. If you go back to the original search.. from the other thread.. "cómo de bueno es" versus "qué tan bueno es" -- Google accomplished the task for us beautifully by showing that the 'qué tan..' instances were about, what?, 10:1 compared to "cómo de.." which represents the difference in Latin American speakers to Spain speakers. That's the kind of result where Google is still telling us the 'whole story' - sufficient for language study.

Grant


----------



## Loob

Sköll said:


> That's another example that clearly shows Google totals are essentially meaningless. I simply don't believe there are over 7,000,000 instances of "cardiothora*s*ic" in Google databases.


Actually, if you look at the individual hits for _cardiothorasic_ (without quotes), many of them are actually instances of _cardiothoracic._

I assume this is Google 'doing its job properly' and helping searchers find instances of correctly-spelt words as well as incorrectly-spelt ones.

I really don't disagree with your premise that google totals should be used with caution. But I really do disagree with the idea they should be banned.

I expect I said the same thing in my previous post; I'm always repeating myself.


----------



## Sköll

Loob said:


> Actually, if you look at the individual hits for _cardiothorasic_ (without quotes), many of them are actually instances of _cardiothoracic._
> 
> I assume this is Google 'doing its job properly' and helping searchers find instances of correctly-spelt words as well as incorrectly-spelt ones.
> 
> I really don't disagree with your premise that google totals should be used with caution. But I really do disagree with the idea they should be banned.
> 
> I expect I said the same thing in my previous post; I'm always repeating myself.


    Thank you for pointing this out. I did not examine individual hits. And you are also right that we seem to be repeating ourselves. 

  I don’t know if banning these numbers is realistic. However, judging by the length of this thread, it should be obvious that this is a problem that cannot be addressed in individual threads.



NewdestinyX said:


> .. from the other thread.. "cómo de bueno es" versus "qué tan bueno es" -- Google accomplished the task for us beautifully by showing that [...]10:1


I’m curious to know how you got this. I get about 400:1 using the number of hits on the last page.


----------



## NewdestinyX

Sköll said:


> I’m curious to know how you got this. I get about 400:1 using the number of hits on the last page.


Fine.. I didn't take the time to recheck the hits and redo the math. But there are 10's of millions
of Latin American speakers compared to several million Spaniards.. so that ratio still proves the point
beautifully. And that's really my last word on the subject. Google told us the truth about that usage
comparison. And that's why it's an 'essential' language study tool.

Chao,
Grant


----------



## Sköll

But how are you using Google totals to arrive at 10:1?


----------



## Nunty

It's been fun, friends, but the Comments & Suggestions forum of WordReference Forums is for comments and suggestions about... well... WordReference Forums. Since this thread has taken a firm turn into other territory, I am closing it now. 

Interested parties can certainly carry on by private message.

Thank you.

Nunty, moderator


----------

