Wednesday, March 05, 2008 at 6:13 PM
Name/User-Agent: Googlebot
IP Address: Verify it here
Looking For: Websites with unique and compelling content
Major Turn Off: Violations of the Webmaster Guidelines
I know, it's never good to over-analyze a first date. We're going to get to know Googlebot a bit more slowly, in a series of posts:
- Our first date (tonight!): Headers Googlebot sends, file formats he "notices," whether it's better to compress data
- Judging his response: Response codes (301s, 302s), how he handles redirects and If-Modified-Since
- Next steps: Following links, having him crawl faster or slower (so he doesn't come on too strong)
***************
Googlebot: ACK
Website: Googlebot, you're here!
Googlebot: I am.
GET / HTTP/1.1
Host: example.com
Connection: Keep-alive
Accept: */*
From: googlebot(at)googlebot.com
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Accept-Encoding: gzip,deflate
Website: Those headers are so flashy! Would you crawl with the same headers if my site were in the U.S., Asia or Europe? Do you ever use different headers?
Googlebot: My headers are typically consistent world-wide. I'm trying to see what a page looks like for the default language and settings for the site. Sometimes the User-Agent is different, for instance AdSense fetches use "Mediapartners-Google":
User-Agent: Mediapartners-Google
Or for image search:
User-Agent: Googlebot-Image/1.0
Wireless fetches often have carrier-specific user agents, whereas Google Reader RSS fetches include extra info such as number of subscribers.
I usually avoid cookies (so no "Cookie:" header) since I don't want the content affected too much by session-specific info. And, if a server uses a session id in a dynamic URL rather than a cookie, I can usually figure this out, so that I don't end up crawling your same page a million times with a million different session ids.
Website: I'm very complex. I have many file types. Your headers say "Accept: */*". Do you index all URLs or are certain file extensions automatically filtered?
Googlebot: That depends on what I'm looking for.
If I'm indexing for regular web search, and I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I will treat it differently than an HTML or PDF link. For instance, JPG is much less likely to change frequently than HTML, so I will check the JPG for changes less often to save bandwidth. Meanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the PDF article than the JPG file. Downloading doodles (like JPGs) and videos of skateboarding dogs is distracting for a scholar—do you agree?
Website: Yes, they can be distracting. I'm in awe of your dedication. I love doodles (JPGs) and find them hard to resist.
Googlebot: Me, too; I'm not always so scholarly. When I crawl for image search, I'm very interested in JPGs. And for news, I'm mostly looking at HTML and nearby images.
There are also plenty of extensions (exe, dll, zip, dmg...), that tend to be big and less useful for a search engine.
If I'm indexing for regular web search, and I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I will treat it differently than an HTML or PDF link. For instance, JPG is much less likely to change frequently than HTML, so I will check the JPG for changes less often to save bandwidth. Meanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the PDF article than the JPG file. Downloading doodles (like JPGs) and videos of skateboarding dogs is distracting for a scholar—do you agree?
Website: Yes, they can be distracting. I'm in awe of your dedication. I love doodles (JPGs) and find them hard to resist.
Googlebot: Me, too; I'm not always so scholarly. When I crawl for image search, I'm very interested in JPGs. And for news, I'm mostly looking at HTML and nearby images.
There are also plenty of extensions (exe, dll, zip, dmg...), that tend to be big and less useful for a search engine.
Website: If you saw my URL, http://www.example.com/page1.LOL111, would you (whimper whimper) reject it just because it contains an unknown file extension?
Googlebot: Website, let me give a bit more background. After actually downloading a file, I use the Content-Type header to check whether it really is HTML, an image, text, or something else. If it's a special data type like a PDF file, Word document, or Excel spreadsheet, I'll make sure it's in the valid format and extract the text content. Maybe it has a virus; you never know. If the document or data type is really garbled, there's usually not much to do besides discard the content.
So, if I'm crawling http://www.example.com/page1.LOL111 with an unknown file extension, it's likely that I would start to download it. If I can't figure out the content type from the header, or it's a format that we don't index (e.g. mp3), then it'll be put aside. Otherwise, we proceed indexing the file.
So, if I'm crawling http://www.example.com/page1.LOL111 with an unknown file extension, it's likely that I would start to download it. If I can't figure out the content type from the header, or it's a format that we don't index (e.g. mp3), then it'll be put aside. Otherwise, we proceed indexing the file.
Website: My apologies for scrutinizing your style, Googlebot, but I noticed your Accept-Encoding headers say:
Accept-Encoding: gzip,deflate
Can you explain these headers to me?
Googlebot: Sure. All major search engines and web browsers support gzip compression for content to save bandwidth. Other entries that you might see here include "x-gzip" (the same as "gzip"), "deflate" (which we also support), and "identity" (none).
Website: Can you talk more about file compression and "Accept-Encoding: gzip,deflate"? Many of my URLs consist of big Flash files and stunning images, not just HTML. Would it help you to crawl faster if I compressed my larger files?
Googlebot: There's not a simple answer to this question. First of all, many file formats, such as swf (Flash), jpg, png, gif, and pdf are already compressed (there are also specialized Flash optimizers).
Website: Perhaps I've been compressing my Flash files and I didn't even know? I'm obviously very efficient.
Googlebot: Both Apache and IIS have options to enable gzip and deflate compression, though there's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily compressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search engine crawler) allow it. Personally, I prefer "gzip" over "deflate". Gzip is a slightly more robust encoding — there is consistently a checksum and a full header, giving me less guess-work than with deflate. Otherwise they're very similar compression algorithms.
If you have some spare CPU on your servers, it might be worth experimenting with compression (links: Apache, IIS). But, if you're serving dynamic content and your servers are already heavily CPU loaded, you might want to hold off.
Website: Perhaps I've been compressing my Flash files and I didn't even know? I'm obviously very efficient.
Googlebot: Both Apache and IIS have options to enable gzip and deflate compression, though there's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily compressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search engine crawler) allow it. Personally, I prefer "gzip" over "deflate". Gzip is a slightly more robust encoding — there is consistently a checksum and a full header, giving me less guess-work than with deflate. Otherwise they're very similar compression algorithms.
If you have some spare CPU on your servers, it might be worth experimenting with compression (links: Apache, IIS). But, if you're serving dynamic content and your servers are already heavily CPU loaded, you might want to hold off.
Website: Great information. I'm really glad you came tonight — thank goodness my robots.txt allowed it. That file can be like an over-protective parent!
Googlebot: Ah yes; meeting the parents, the robots.txt. I've met plenty of crazy ones. Some are really just HTML error pages rather than valid robots.txt. Some have infinite redirects all over the place, maybe to totally unrelated sites, while others are just huge and have thousands of different URLs listed individually. Here's one unfortunate pattern. The site is normally eager for me to crawl:
User-Agent: *
Allow: /
Then, during a peak time with high user traffic, the site switches the robots.txt to something restrictive:
# Can you go away for a while? I'll let you back
# again in the future. Really, I promise!
User-Agent: *
Disallow: /
The problem with the above robots.txt file-swapping is that once I see the restrictive robots.txt, I may have to start throwing away content I've already crawled in the index. And then I have to recrawl a lot of content once I'm allowed to hit the site again. At least a 503 response code would've been temporary.
I typically only re-check robots.txt once a day (otherwise on many virtual hosting sites, I'd be spending a large fraction of my fetches just getting robots.txt, and no date wants to "meet the parents" that often). For webmasters, trying to control crawl rate through robots.txt swapping usually backfires. It's better to set the rate to "slower" in Webmaster Tools.
User-Agent: *
Allow: /
Then, during a peak time with high user traffic, the site switches the robots.txt to something restrictive:
# Can you go away for a while? I'll let you back
# again in the future. Really, I promise!
User-Agent: *
Disallow: /
The problem with the above robots.txt file-swapping is that once I see the restrictive robots.txt, I may have to start throwing away content I've already crawled in the index. And then I have to recrawl a lot of content once I'm allowed to hit the site again. At least a 503 response code would've been temporary.
I typically only re-check robots.txt once a day (otherwise on many virtual hosting sites, I'd be spending a large fraction of my fetches just getting robots.txt, and no date wants to "meet the parents" that often). For webmasters, trying to control crawl rate through robots.txt swapping usually backfires. It's better to set the rate to "slower" in Webmaster Tools.
Googlebot: Website, thanks for all of your questions, you've been wonderful, but I'm going to have to say "FIN, my love."
Website: Oh, Googlebot... ACK/FIN. :)
***************


26 comments:
Fun post and creative to say the least! By the way, thanks for including "If-Modified-Since". Can't wait till the second date! : )
You say that the googlebot accepts gzip ...
Let's say a website's homepage is 300k sans-gzip before all images, etc -- kinda big. Post-gzip, it is 18k plus images -- much better for the user.
As far as I can tell, the Google-reported "size" is 300k in the SERPs. Why isn't the page listed at 18k provided that gzip is enabled and working properly?
FWIW, this is, obviously, a real world example about a site of mine. The site has been re-indexed a few times since gzip was enabled, so I know that is not the problem.
Thanks for the post. :)
hehehe.. a very creative way on how googlebot reads a site... I can't wait for the second date... Till then.
While I think this has been a phantastic date, Googlebot still reminds me of an untimely boyfriend. You love him but he insists on visiting you when you can only kiss him goodnight! I think that simply telling him to move at a slower pace won't help any of us. If only he could come at another hour!
Oh Googlebot, you're such a dream boat. You're everything Im looking for in a crawler.....
Im just afraid that you won't respect me the morning after.....
lol, one of the bets posts ever!
Cool stuff. I implemented "If-Modified" already two weeks ago. :)
Looking forward for the second date.
Ok I actually felt Google Bot was a person. I would be fooled if I don't know better. Nice post.
I have a question to do with your Media Bot. I am the webmaster of http://exposureroom.com
This bot is driving me crazy because it does not respect the rel="nofollow" attribute on links.
As a result it's attempting to go into secure pages but because the pages are, well secure the server issues an error. It so happens that I get emails for errors on my site.
I can't put an entry for every secured page on the site in robots.txt so what are my options?
I mean I don't mind media bot doing it's thing as long as it respects the rel="nofollow"
Hi shiv - the Mediabot is trying to access all pages where you placed AdSense elements to be able to target them best. If you prefer to keep the crawler out of these pages, you need to make sure that they do not contain any of these elements.
John,
Ok, I think I need to explain some more. On the face of it what you say seems logical, however, in reality that might not be the case.
For example we have a page that is open to the public (in fact all of the site is open to the public).
http://exposureroom.com/members/skumar.aspx/
Now as a member, when I've loggedin, and I choose to be in edit mode, this same page will have a link that lets me edit the content on this page.
Well the bot tries to to go to the page that the link points to (which has a rel="nofollow" attribute set to it).
That's not right. Yes, we have ads on this page (it's the same page really just the appearance changes due to the user choosing to edit the page)
I think we (publishers) need more control, don't you agree?
Hello Google Team,
Its been more than 20 days Google has not indexed my new blog http://abovetopsecret.blogspot.com/ when i google search site:abovetopsecret.blogspot.com i not even get a single hit i get the following result:- Your search - site:abovetopsecret.blogspot.com - did not match any documents.
Hi guys,
Something nothing to do with this post: Can someone, plz, tell me how to get a site blacklisted. Other way than sending a "Request reconsideration", because it doesn't work for me.
Nice post! But in the "what googlebot sees" page the list of keywords are not the keywords I have listed in my scifi website.
Nice story! But I was thinking that your should follow your slogan "Don't be evil". I mean the googlebot picture looks a bit evil to me :)))
Do we have to write love letters if we wish to ask questions? Well, here goes.
Dear Google bot,
In all my dreams I caress those voluptuous algorithms of yours, indescribable sensation offering such orgasmic search results. A question however keeps haunting me, do you only follow links generated with the "a" element ? What about, for instance, the "param" or "embed" element used for displaying Flash ? I'm wondering if there's any point using the "nofollow" value on those two tags. Whatever your answer may be, I still love you, visit me soon. <3
Robert,
That was really good!
The last bit I would have written something like:
I look forward to your frequent visits so I can feel your every caress :)
Damn. Thanks for this article!
I'm a real dumbshit so I'm glad someone put this in such a straight-foreword, cute, and funny way.
And this website is getting bookmarked.
Thanks for the useful information.
I can't believe you send an accept header of */*. It is clear that you do much worse with XML than HTML. I'm trying to figure out how to determine whether server side rendering of the XML is needed and you guys do nothing to help with that.
Ahh googlebot. You are really dumbing down gen Y-ers, thanks to your drivel you keep serving up in your searches.
http://justtofun.blogspot.com/2008/05/fast-food-information-ala-google.html
Sometimes it's very problematic to see that the already validated Google Sitemap file located in the root directory of the website(s) show error that the Google Bot is not be able to locate the file in the Webmaster Central. I have seen that most of my client's website which we have validated already for the sitemap is shown error when we have logged in after 6 months in our Google Webmaster Central account.
There is a recommendation that Google must have to keep a state of already validated domains somewhere, if due to some reason the bot is not be able to get the validation xxx.html file from the root folder of the domain(s) then the bot can re-check the already validated domains for the the .html file.
With Regards,
Mentor.
Dear Friends,
I am building a new blogspot site. http://banglatrends.blogspot.com/ It is already validated. But my site isn't indexed yet. Google search cannot find my site. I have a feedburner account and it shows the site activities result. Atom feed is also okay. Can anyone help me to solve the problem?
ooo googlebot please , dont be afraid to come and visit my site...lol
http://shiitake.freehostia.com
http://www.oyster.0fees.net
thans info.....!
euh, the googlebot picture kinda scary even though it carrying flowers. Lol.. haha. Hei, googlebot I wanna ask you about how you determine page rank. do you mind to share it with us?
Hi everyone,
Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Central Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment