Sunday, June 01, 2008 at 11:06 PM
Many of you have asked for more information regarding webserving techniques (especially related to Googlebot), so we made a short glossary of some of the more unusual methods.
- Geolocation: Serving targeted/different content to users based on their location. As a webmaster, you may be able to determine a user's location from preferences you've stored in their cookie, information pertaining to their login, or their IP address. For example, if your site is about baseball, you may use geolocation techniques to highlight the Yankees to your users in New York.
The key is to treat Googlebot as you would a typical user from a similar location, IP range, etc. (i.e. don't treat Googlebot as if it came from its own separate country—that's cloaking). - IP delivery: Serving targeted/different content to users based on their IP address, often because the IP address provides geographic information. Because IP delivery can be viewed as a specific type of geolocation, similar rules apply. Googlebot should see the same content a typical user from the same IP address would see.
(Author's warning: This 7.5-minute video may cause drowsiness. Even if you're really interested in IP delivery or multi-language sites, it's a bit uneventful.) - Cloaking: Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines. If the file that Googlebot sees is not identical to the file that a typical user sees, then you're in a high-risk category. A program such as md5sum or diff can compute a hash to verify that two different files are identical.
- First click free: Implementing Google News' First click free policy for your content allows you to include your premium or subscription-based content in Google's websearch index without violating our quality guidelines. You allow all users who find your page using Google search to see the full text of the document, even if they have not registered or subscribed. The user's first click to your content area is free. However, you can block the user with a login or payment request when he clicks away from that page to another section of your site.
If you're using First click free, the page displayed to users who visit from Google must be identical to the content that is shown to the Googlebot.


29 comments:
I use scripts (Wordpress plugins) to offer different content for smartphones (Windows mobile) and iPhones.
Is this a violation of Webmaster's guidelines?
You could have been a news anchor :)
The camera person might want to keep their day job however ;)
wow, i didnt know about "first click free" feature. that is very useful.
but now why haven't they implemented this for Google Web Search and not just News.
our site, and many others i can think of have subscription content which is not news based and would therefore hugely benefit from this feature being added to the regular google search.
How does Amazon continue to get away with delivering different content when the user agent is googlebot?
"The key is to treat Googlebot as you would a typical user from a similar location, IP range, etc. (i.e. don't treat Googlebot as if it came from its own separate country—that's cloaking)."
Let's say I wanted to redirect US traffic from a .co.uk site to the .com and vica versca based on the users I.P. location - is this article saying it would be ok to do so and not considered cloaking?
I have been wondering for the longest time how google allows the new york times to violate the cloaking rules in the web master guide lines. How long has this first click free thing been around? Nice way to get high profile news sites in your index while keeping within your guidelines btw :P
This is tricky for us. We're a job board that shows different jobs depending on the IP. But our Geo targetting code always says that Googlebot is a US host, so it never gets to see the jobs from different countries.
How I can tell if it's a 'UK' Googlebot or a French one or a German one? I wish they'd clarify this.
@mintyco
I would just include a link on the page somewhere for jobs in each other country you offer them in. That way, google will find them, and also if your geotargetting identifies someone incorrectly they can still find what they are looking for.
If Google is going to use a byte-comparison approach on useragent=googlebot vs. non-bot GETs of the same page, you're headed for a major problem: many, many sites have rotating/random content (such as specials...or AdSense ads :-p), so that even for the same IP, same user agent, etc. two GETs of the same page a second apart from even the same machine will yield different results.
I sure hope the comment about MD5/checksum was casual, and not a real indication of what your engineers are planning.
Thanks for the post and enthusiasm.
The post said this re: CLOAKING:
“Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines. If the file that Googlebot sees is not identical to the file that a typical user sees, then you’re in a high-risk category. A program such as md5sum or diff can compute a hash to verify that two different files are identical.”
That’s not very much in fact. Not much more than what was up already. And that’s a problem because at the moment the Google definition of cloaking is vague:
1. What about REST? URI’s represent different resources. Yet more than one representation of a resource could exist at the same URI. Performing an MD5 on a URI is not good enough.
2. The wording used to describe cloaking is always “different content”. Google should be more proactively specific in this regard. Does Google mean to say that serving the same content represented in different formats/structures (xml/html/plain text/Javascript-enabled) is cloaking? Or does Google mean to say that serving different content (regardless of format or structure) is cloaking?
Implications of the above:
If cloaking truly means serving different content as opposed to the same content with different structure, then webmasters will be able to push client-side implementations (as Google themselves have done) using the fragment-identifier and Javascript, and then serve the same content, albeit in different (plain-text) structure to Google.
The world is clearly going client-side. This needs to start happening!
RE: Author's warning
Don't knock yourself, your in hardcore search nerd terrirory :-)
What about countries where people speak several languages?
WHAT happens to ccTLDs?
If I have a .ro domain name, does that mean that I will be able to use it only to target visitors from Romania?
If so, does a .com imply the use of Comish language on a web site, the same way an .us implies the use of en-US language?
(Note: "WHAT" is capitalized just to draw focus.)
I notice that Google themselves ignores my preferred browser language settings, displaying Adsense ads in Spanish, based on my IP address. Then again, Facebook have the same problem: http://benpowell.blogspot.com/2008/03/facebook-cant-get-their-advertising-to.html
Matt Cutts defined cloaking as showing different content to users and different content to search engines. What if you show a part of your content to users, and that part not at all to search engines?
So does this mean that we can tell the Googlebot from which Geolocation we would like the Bot to come from too !!!
or maybe, Google would like to send their Bot to my site from every country that my clients market tailored content to ??
We redirect the GoogleBot to a page that's essentially the same but with a cleaner URL and without some "Did You Mean..." links. We were once told that this would not be considered cloaking. Could this be a problem now? I don't think any human would object to this, but could we get caught in an automated test?
There are a few points about this that trouble me, the main one is to do with the idea of using content hashes to identify different content being served to different user agents.
The problem with a hash is that (by design) the tiniest change in the input completely changes the result. So how would Google tell the difference between a page that was serving different content to different agents and one that was serving different content for each hit.
Given the way hashes work 'different' here need only be a different 'timestamp', 'hit counter' field, or a url augmented to support session tracking (where cookies not available). In these cases the hashes would be completely different each time the content hash was recalculated - whatever user agent it was served to.
If what is needed is a simple way of quantifying how 'different' two sets of content are (and whether these differences are significant), there are much better ways of tackling this kind of problem than hashing. I may be wrong, but I thought Google already used proper 'simularity' algorithms.
I disagree with the concept that different content based on the user-agent is necessarily cloaking! For example, I make sure to use the exact same CONTENT for a user visiting most of my pages but often strip certain elements based on user-agent; thus I can keep the absolute address of a document the same no matter what platform is getting it, but leave out irrelevant objects for some browsers (in the case that a visitor is viewing on an smartphone, a document will serve thumbnails instead of large images, some JavaScripts/CSS are stripped, iframes are dropped, etc; the same document will be essentially the same if viewed on a desktop browser except presented in a desktop-friendly fashion).
There are also times that a developer may opt to serve different JavaScripts or CSS files based completely on desktop browsers that still do not affect the actual content--a Firefox CSS file vs an IE CSS file, etc. Or loading various CSS and JavaScript based on OS as well!
As long as the end-user is still getting the same information, then Google should not punish a site for changing the exact format of the presentation--that should not be labeled "cloaking".
If google really wanted to apply the policy: “users should always see the same page that Googlebot saw”,
They would not allow website like
expedia.com
hotels.com
opodo.com
(just to mention some travel sites, there are many others)
To show a completely different homepage (automatic redirection depending on ip geo location) depending if you are a bot or not. Bots don’t get any redirection.
In addition to this if this was a "risk area" why they are not even checking the Ip? or other techniques to hide this fact.
They all use simple User-agent detection. And even a 301 redirection… Ip geolocation is now permanent?!
It would be very interesting to have an answer from some Google guy ... but I guess they never will. Being vague and talking about "hash" techniques (which are useless in this case) is the best way to keep many people off doing this.
If I am wrong, please someone tell me!! :)
Ok after watching the video I tested google.com from Italy and from Germany. I thought that the idea of using Accepted-Language to decide whether to redirect or not was a good one.....
well, google DOESN'T DO IT. No matter what the Accepted-Language was (even empty one) or the Browser language. Or any info sent to the server it would always redirect to google.de/.it. (And this is also very annoying at times).
Why using google as an example and not even following their own guidelines?
Wow, Maile is incredibly hawt. Teach me more. This is the first I've heard of ip delivery. Great overview.
Could you please point me/us in the direction of a script that does what your video talks about, one that automatically redirects to a subpage of your website if it detects the IP is from another language that is Google friendly.
Also if you have a mobile redirect script that is Google friendly please point that out to. We have several sites that have a #1 position in your engine and want to make sure to not to lose them because we install a non friendly script.
I found a site that builds links for a .org site when you try to go to the .org site it forwards the site to a com/ which links are being built all the time for it as well
I am not quit sure that it is cloaking or not but links are being built for both sites and the .org site does not exist so there is benefit from a .org site as well as a .com site
I've summarized several articles of Google on geo- and language targeting in one (relatively) simple schema: SEO tutorial on geo targeting and language targeting. I hope this will be of help.
Jeez,
I cant believe how many people are being so anal, and picking this apart.
The main concept is to not try and cheat the search engine.
example ...
if useragent = googlebot then display google optimized content
else
display different content for a human.
That is the clearly defined violation.
the md5 isnt a viable solution to google and they should know this.
Generally a US google bot should see the same thing a US person behind a typical browser see's.
Has anyone done what @Jeff suggested in is note.
Build a 'Regions' menu and auto switch the user to their geo-located region. All of the other regions information would be available through the menu.
Stuart
Hi everyone,
Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment