Thursday, March 27, 2008 at 1:27 PM
We all know how friendly Googlebot is. And like all benevolent robots, he listens to us and respects our wishes about parts of our site that we don't want crawled. We can just give him a robots.txt file explaining what we want, and he'll happily comply. But what if you're intimidated by the idea of communicating directly with Googlebot? After all, not all of us are fluent in the language of robots.txt. This is why we're pleased to introduce you to your personal robot translator: the Robots.txt Generator in Webmaster Tools. It's designed to give you an easy and interactive way to build a robots.txt file. It can be as simple as entering the files and directories you don't want crawled by any robots.
Or, if you need to, you can create fine-grained rules for specific robots and areas of your site. Once you're finished with the generator, feel free to test the effects of your new robots.txt file with our robots.txt analysis tool. When you're done, just save the generated file to the top level (root) directory of your site, and you're good to go. There are a couple of important things to keep in mind about robots.txt files:- Not every search engine will support every extension to robots.txt files
The Robots.txt Generator creates files that Googlebot will understand, and most other major robots will understand them too. But it's possible that some robots won't understand all of the robots.txt features that the generator uses.
- Robots.txt is simply a request
Although it's highly unlikely from a major search engine, there are some unscrupulous robots that may ignore the contents of robots.txt and crawl blocked areas anyway. If you have sensitive content that you need to protect completely, you should put it behind password protection rather than relying on robots.txt.
We hope this new tool helps you communicate your wishes to Googlebot and other robots that visit your site. If you want to learn more about robots.txt files, check out our Help Center. And if you'd like to discuss robots.txt and robots with other webmasters, visit our Google Webmaster Help Group.



61 comments:
Combined with the robots.txt analysis, thats a nice little tool.
When generating the text for the robots.txt, it would be great if the tool could add in the sitemap declaration also, depending on what sitemaps have been submitted to Google.
Thanks.
Was just about to mention the sitemap inclusion, well a way would be to add the sitemap manually in the downloaded robots.txt file.
Anyway, a nice improvement
Have you guys made changes recently that would account for the widespread reports of people not being able to verify sites in Webmaster Tools?
A quick search shows hundreds (thousands???) of people that for no reason can't verify their site that has been verified for years.
Very nice tool.
However would be nice for Google to provide a link in GW to immediately flush links that get caught by our Robots.txt out of the Google Index.
Rather than having to wait weeks/months to be removed.
Or Instead of having to remove it one by one. Or per directory.
Is googlebot really a "He"? why can't googlebot be a "she" or "it"?
Is there a chance Blogger users might be able to edit their robot.txt files in the future? I'd love to be able to utilize this tool.
BTW, I just started reading Issac Asimov's "Robot Visions" this morning. Heh.
Is there anyway to direct the robot to crawl the images of the site?
Hi folks:
Thanks for the feedback and suggestions; I'll pass them along.
Jason, we're looking into that issue. You can follow the progress here.
Hey Ghosty, what types of changes would you want to make to your robots.txt file? Is there something you're unable to accomplish with the current configuration?
Well, the URLs for labels are considered naughty as far as my sitemap is concerned. I might try to impliment them that way.
I'm at work, I'll think of other things tomorrow. I'm sure there's more than one or two uses for Blogger users.
indeed this tool is very helpfull, i used it to create my robots.txt file, you just download it from google tools, save it to ur web page and thats it, perfect
Thank you,
G.
http://www.howtoguidehome.com
Buon fine settimana da Maria
@Susan: OK, here's what's on my mind ... currently, Googlebot cannot access URLs linked from labels. If I generate the label "Baltimore", for example, Blogger generates this URL for it on my site:
http://addh.blogspot.com/search/label/Baltimore
Yet Googlebot cannot access this URL for some reason, so it comes up as an error in my webmaster tools.
I'd also use it to limit Googlebot in certain ways, for example, not to index images on my site, or to index some labels, but not others.
The advices to place some content behind a secure login is good but I find that it must also be combined with the robots.txt file to be effective. Also just in case some robots do try to index those pages, remember to have custom pages desigened to rimind users that they must be logged in. Then even if the robots try to access pages behind a login, the pages indexed will simply be login reminders but still valid pages.
I'm using blogspot, last time i analyzed the robots.txt i found that robots.txt is blocking entire page of my blog!!! You knew "URLs restricted by robots.txt"
Is there anyway i could edit/reupload the robots.txt for blogspot?
I had to write my own robots.txt all this time but I don't understand why this generator is said to work fine only with Googlebot and left with un certainity for other bots.Google guys are smart people, well might be other bots like their own version of robots.txt
Ghosty (and CS staff),
Thanks for your feedback.
The reason your .../search/label/... pages are blocked from crawling is because all the content available on those pages is also available at other URLs. So your robots.txt file isn't actually preventing any of your content from being indexed; for example, even though http://example.blogspot.com/search/label/Baltimore is blocked by robots.txt, any posts with that label (such as http://addh.blogspot.com/2008/03/post-about-Baltimore.html) are still available to crawlers. Blocking the /search/ directory is designed to reduce duplicate content.
Also, you wouldn't be able to prevent indexing of images from your blog with your robots.txt file, since the file only governs content located on addh.blogspot.com, but your images are hosted on photobucket and/or blogger.com.
I'm trying to figure out whether there's a strong need for Blogger users to be able to edit their robots.txt files, but so far I'm not sure I've found one...
@Susan: Thanks for the feedback. Yes, I understand the need to prevent duplicate content, so this does make sense.
Still would like to have some control over images (Blogger hosted ones) via robots.txt, just like any other site owner would have on their own domain ... not me personally, perhaps, but I'm willing to bet some folks would find it useful. It's all about options, I say. :D Thanks for the communication.
That is a great idea, great help for web masters, thanks for that Google.
I run a small claims court site from Toronto, Ontario Canada called www.MrSmallClaimsCourt.com. It has been up a few months, and I submitted it to the google index but for some reason my pages are only being indexed on google's web pages section but not the pages from Canada section. Anybody have any explanation? Thanks very much.
Hi MrSmallClaims,
This article should answer your questions. We generally look at factors like your site's TLD and where your site is hosted, but you can use our geographic targeting tool to let us know if you're trying to target searchers in Canada.
Susan, thanks for your respone. However, I just learned that Google is not indexing some of my pages at all such as the home page for starters. Check out
http://64.233.183.104/search?sourceid=navclient-ff&ie=UTF-8&q=cache%3Ahttp%3A%2F%2Fwww.mrsmallclaimscourt.com%2F
Any ideas y not??? Thanks a ton.
Jennifer Mathews Somogyi, what do you think?
IMHO: Maby googlebot is really a "He", but he likes guygoogleboyts (not girls)! ;)
You site is accessible at both www.mrsmallclaimscourt.com and mrsmallclaimscourt.com (with or without the www), and we're indexing the version without the www:
[site:mrsmallclaimscourt.com]
Read this for more information.
Susan, sorry if I'm asking dumb questions here - does it matter which one is the preferred domain? Also, it seems like the indexes were hitting 9 pages a day in February but since tailed off. Any ideas why? Further, I added Canada as the geographic tool so I guess it mayt take a while to cacth on with the pages from canada index?
Thanks a ton for helping me!!!
Since these questions are specific to your site and not relevant to this blog post, let's move the discussion to our Help Group.
I have a blogspot blog at http://www.sriraminhell.blogspot.com . How do I edit the robots.txt as the blog is hosted on blogspot's server and I don't have access to the file.
How is it possible that google robots is not able to detect spaming? Some Albanian pages have a title too long , to make spam and they repeat the same word. for example: Argetohu.com , tiranachat.net etc.
They have the first rank on google with keywords: " Muzik Shqip " or " Mp3 shqip" and they are not good pages for this.
Thanks google team for this tool this is a great help.
I would like to add
User-agent: Googlebot-Image
Disallow: /
to my blogger robots.txt, since goggle index pictures (thumbnails) that are long gone from the blog.. even that i have remove them the proper way..
i also have an other problem, sometimes when i delet pictures i get a message "ther was a problem deleting these pictures please contact support" but a support is nowhere to be found =).. good bye!
I was running a blog (techsupportforpc.blogspot.com) and this was my first blog. Since it was not indexed for quite some time and unable to find the reason behind it I decided to delete the url ie. techsupportforpc.blogspot.com and I created another url called techsupportforum.blogspot.com and I replaced all the posts from my earlier blog to this new blog. Now my current blog is showing robots.txt blocking urls error for about 27 urls after about 1 month. Since blogger does not allow editing robots.txt i am unable to find the solution. My blog is not showing in google. Can anybody help me?
@mark: I got same experience as yours when my blogs (fasthing.blogspot.com and anyutilities.blogspot.com) were very young. But you don't have to worry about it, without doing anything, now my robots.txt for both blog are working well.
SO, we still cant modify the robots.txt in blogger don't we?
Can someone tell me how to save the robot.txt file in the top directory of my blog in http://mishsoftwares.blogspot.com
I would appreciate if someone explain it to me step by step.
I have already generated the robot.txt but i do not know how to save it in the top directory of my blog.
Right now the pages are resitricted by robots.txt
Yes. Robot.txt disallow all your posts to be read in search engines unless you have 1 post per blog.
So you can create 3 blogs for your 3 post where each blog is named just like your post title. In that case, your blogs will be placed in google search engine. But i don't know if it works. But if you want you can use wordpress or yahoo.geocities if you want. They dont disallow your blogs to be searched in search engines.
I will now transfer to yahoo geocities now because our posts does not mean a thing to search engines. Whatever you do, as long as robot.txt disallow. Your post wouldn't be read.
Thank you
I also have the issue with url restricted by robots.txt 59 due to labels. Found through site maps.
The real problem is that since July 31st my posts have not been indexed on Google. I've never had that problem.
I recently changed my Domain name through blogger from Obsession Collection to Obsession Collection Music.
help!!
http://www.obsessioncollectionmusic.com/
By Default Blogger is Restricting Access to Labels.
As you may also experienced in google web master tool.
But i think this process affect the google Indexing Service is't it ?
especially when you are using a goolg Blog ??
Susan....
Any word on the robots.txt issue?
My blog is http://theimreporter.blogspot.com.
Everything seemed to be fine..I had a pr2 and then all of a sudden my pr disappeared and 26 urls are resricted by robots.txt
Any assistance would be great
Hi Sean,
Are the blocked URLs under the /search/label/ directory? If so, that's expected for Blogger blogs and is nothing to worry about (see my comment to Ghosty above).
Regarding your PR, the short answer is not to worry about it; but if you want more detailed feedback you should post your URL in our Webmaster Help Group.
This is really strange. Google has indexed the wrong website. When I check the index stats for http://www.squidoo.com/parable it says my page is indexed. I click on the indexed page and it takes me to www.squidoo.com/parable-of-the-sower
My site was indexed in google but since this happened, it has vanished.
No clue how to fix this one, any ideas?
I manage a site for a magazine. It has a subscriber only section accessed by logging in.
I want the content hidden to normal users to be available for robots to index. How do I manage this?
The site is run using a MySql database.
Hi,
I wanted to know if it is possible to block subdomains such as -
abc.xyz.com
We have some subdomains that is the exact copy of the he main site - http://www.xyz.com
Instead of a 301 redirect, is it okay to use a robot.txt to disallow the indexing of such subdomains.
Because of these we have a lot of duplicate issues.
Thanks,
Ranjana
@Ranjana:
Doing a 301 redirect would be a much better solution in this case. However, if you're unable to do that, you can block subdomains from crawling by placing a robots.txt file on each subdomain root, e.g. abc.example.com/robots.txt
Please tell me how to edit my robot file in google.
http://healthylife9i.blogspot.com/
@information-9:
You can't edit the robots.txt file of a blog hosted on blogspot.com.
What edits did you want to make?
I have a blogspot that not all posts indexed by Google. When I check the webmastertool, it say i have 67 URLs limitted by robot.txt. I don't know how to fix this problem, can you show me how to fix this ? Thanks.
@ Jennifer Mathews Somo...some woman
Googlebot, AFAIK was created by men - just like the Internet and the WWW it crawls. So, if it has to have a gender, it must be a 'he'.
Simple as that.
Natasa
Make sure robots.txt includes
content="index,follow"
For pages you do not want indexed:
content="NOINDEX,NOFOLLOW"
I had put up a robots.txt file to disallow any web-crawling for a month while we got our website in shape. Now I've removed it, but it's been several days, and Google Webmaster Tool says it's still blocked from crawling. How long does it take for Google to realize that it's gone? I just replaced it with a blank robots.txt file, and I'm hoping that might help.
Thanks,
PG
What am I doing wrong? - I generated a simple two line robots.txt file, tested it and got a flood of syntax errors?
This is the file:
http://www.spainforyou.net/
Disallow: /album/*.*
HELP! I want to get the meta tag for my blog so I can paste it in the Edit HTML in the blogs lay out so I can sumit a sitmap. Here is what is going on. In the webmaster tools dashboard it has my blog and it has a green check mark saying that is is allready verified. So I have nothing to click to get the tag to come up. I have not pasted a meta tag in my blog. I have added a sitemap by putting the atom feed url in my add site map page and it says ok but i am afraid that without the meta tag i will have to resumit it eveytime I make a change to blog. How do I get the meta tag if all i have is a unclickable green checkmark? Do I even need the meta tag? Will I have to redo my site map everytime i make a change to my blog. Why is this so hard?
@Susan Moskwa,
If robots.txt is restricting my search/labels because it is duplicate content than how do I prevent from seeing this error on Google Webmaster tools dashboard.
My paid survey sites blogspot is not showing up in the serps due to duplicate content that exist in cached pages on the web.It has long since been removed, but the cached web pages still exists causing Google to penalize me, I think so, I'll just wait until until its removed from the index.
@Susan Moskwa,
If robots.txt is restricting my search/labels because it is duplicate content than how do I prevent from seeing this error on Google Webmaster tools dashboard.
My paid survey sites blogspot content is not showing up in the serps due to duplicate content that exist in cached pages on the web.It has long since been removed, but the cached web pages still exists causing Google to penalize me, at least that's what I think, I'll just wait until until its removed from the index.
@Darius: Duplicate content from /search/labels should not result in a penalty. Read this article and please post any follow-up questions in our Help Forum.
How to change the robots
Am frequently getting restricted by robots error
hello guys good day....
i am just new to the blogger and i find it hard to verify my site, can you tell me what exactly to put in my blog homepage for my blog to be verified?
or example of a blog with a meta tag on homepage?
thank you
i want to allow Mediapartners-Google in robots.txt for my blog in blogspot. Is that possible??
Google webmaster is really friend but for blogger blog we can't change robot.txt file because there is no option to change until it's on special web server.
Does anyone knows about how to change blogger robot.txt file.
http://twitter.com/alamest
Hi everyone,
Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment