The robots.txt file is a relatively useful and powerful tool for instructing search engine crawlers on how you want your website to be crawled.
It is not all strong (in Google’s own words“is not a mechanism to keep the web page out of Google”) but can help prevent your site or server from being overloaded by crawler requests.
If you have this crawl block in place on your site, you need to be sure that it is being used correctly.
This is especially important if you use dynamic URLs or other methods that create a theoretically infinite number of pages.
In this guide, we’ll look at some of the most common problems with robots.txt, their impact on your website and search visibility, and how to fix them if you think they’ve occurred.
But first, let’s take a quick look at robots.txt and its alternatives.
What is a robots.txt file?
The robots.txt file uses a plain text file format and is placed in the root directory of your website.
It should be in the top directory of your site. If you put it in a subdirectory, search engines will simply ignore it.
Despite its great power, robots.txt is often a relatively simple document, and a basic robots.txt file can be created in a matter of seconds with an editor such as Notepad.
There are other ways to achieve some of the same goals that robots.txt is typically used for.
Individual pages can include a robots meta tag within the code of the page itself.
You can also use the X-Robots-Tag HTTP header to influence how (and if) content appears in search results.
What can robots.txt do?
The Robots.txt file can achieve a wide variety of results across a range of different content types:
Web pages can be prevented from being crawled.
It may still appear in search results, but it will not have a text description. Also, non-HTML content on the page will not be crawled.
Media files can be blocked from appearing in Google search results.
This includes photo, video, and audio files.
If the file is public, it will still be “there” online and can be viewed and linked to, but that private content will not appear in Google searches.
Resource files such as junk external scripts can be blocked.
But this means that if Google crawls a page that requires that resource to be loaded, the Googlebot bot will “see” a copy of the page as if that resource did not exist, which could affect indexing.
You cannot use a robots.txt file to completely block a web page from appearing in Google search results.
To achieve this, you have to use an alternative method like adding a noindex meta tag to the page header.
How Serious Are Robots.txt Errors?
A robots.txt error can have unintended consequences, but it’s often not the end of the world.
The good news is that by repairing your robots.txt file, you can quickly (usually) fully recover from any errors.
Google Guidelines for Web Developers It says this in the robots.txt errors thread:
“Web crawlers are generally very resilient and will not usually be affected by minor errors in the robots.txt file. In general, the worst that can happen is that this is incorrect [or] Unsupported directives will be ignored.
Keep in mind that Google cannot read minds when interpreting a robots.txt file; We need to interpret the robots.txt file we fetched. However, if you are aware of the problems with the robots.txt file, they are usually easy to fix.
6 common robots.txt errors
- The robots.txt file does not exist in the root directory.
- Misuse of wildcards.
- Noindex in robots.txt file.
- Blocked scripts and style sheets.
- There is no sitemap URL.
- Access to development sites.
If your website is behaving strangely in search results, the robots.txt file is a good place to look for any errors, syntax errors, and override rules.
Let’s look at each of the above errors in more detail and see how to make sure you have a valid robots.txt file.
1. The robots.txt file is not in the root directory
Search bots can only detect the file if it is in the root folder.
That’s why there should only be a forward slash between the .com (or equivalent domain) of your website, and the file name “robots.txt”, in your robots.txt URL.
If there is a subfolder in there, your robots.txt file is likely not visible to search bots, and your website probably behaves as if there is no robots.txt file at all.
To fix this problem, move the robots.txt file to the root directory.
It should be noted that for this you will need to have root access to your server.
Some CMS will upload files to a “media” subdirectory (or something similar) by default, so you may need to work around this to get the robots.txt file in the right place.
2. Misuse of wildcards
The Robots.txt file supports two wildcards:
- asterisk * which represents any examples of a valid character, such as the Joker in a deck of cards.
- $ dollar sign that denotes the end of the URL, allowing you to apply rules only to the last part of the URL, such as the file type extension.
It makes sense to take a simplistic approach to using wildcards, as they can apply restrictions to a much larger portion of your website.
It’s also relatively easy to end up blocking bot access from your entire site with a poorly placed asterisk.
To fix the wildcard issue, you’ll need to locate the incorrect wildcard and move or remove it for the robots.txt file to work as intended.
3. Noindex in robots.txt
This is more common on websites that are more than a few years old.
Google has ceased to comply with the noindex rules in robots.txt files as of September 1, 2019.
If your robots.txt file was created before this date, or if it contains noindex directives, you will likely see those pages indexed in Google search results.
The solution to this problem is to implement an alternative “noindex” method.
One option is the robots meta tag, which you can add to the header of any webpage you want to prevent Google from indexing.
4. Blocked scripts and style sheets
However, remember that Googlebot needs access to CSS and JS files in order to properly “see” HTML and PHP pages.
If your pages are behaving strangely in Google results, or Google doesn’t seem to be seeing them correctly, check if you’re blocking our crawler’s access to required external files.
A simple solution to this is to remove the line from the robots.txt file that blocks access.
5. No sitemap URL
This is more about SEO than anything else.
You can include your sitemap URL in your robots.txt file.
Since this is the first place Googlebot looks when crawling your website, this gives the crawler a head start on knowing the structure and home pages of your site.
While this isn’t entirely wrong, since deleting your sitemap shouldn’t negatively affect the actual core functionality and visibility of your website in search results, it’s still a good idea to add your sitemap URL to your robots.txt file if You want to give your SEO efforts a boost.
6. Access to development sites
Blocking crawlers from your live website is a no-go, but it also allows them to crawl and index your still-developing pages.
It’s a best practice to add a disallow directive to the robots.txt file for a website under construction so that it won’t be seen by the general public until it’s finished.
Likewise, it is necessary to remove the disallow instructions when starting a completed website.
Forgetting to remove this line from your robots.txt file is one of the most common mistakes among web developers, and it can prevent your entire website from being crawled and indexed properly.
If your development site seems to be getting real-world traffic, or a recently launched website doesn’t perform well at all in search, look for the global user-agent disallow rule in your robots.txt file:
If you see this when you shouldn’t (or you don’t see it at the right time), make the necessary changes to your robots.txt file and check that your website’s search theme is updated accordingly.
How to recover from Robots.txt error
If an error in your robots.txt file is having undesirable effects on your website’s search appearance, the first and most important step is to correct your robots.txt file and verify that the new rules have the desired effect.
Some SEO crawling tools can help with this so that you don’t have to wait for search engines to next crawl your site.
When you are confident that the robots.txt file is behaving as intended, you can try to get your site back crawled as soon as possible.
Submit an updated sitemap and request that any inappropriately deleted pages be recrawled.
Unfortunately, you’re at a Googlebot’s whim – there’s no guarantee how long it might take for any missing pages to appear in the Google search index.
All you can do is take the right action to reduce that time as much as possible and keep checking until the static robots.txt file is executed by Googlebot.
When it comes to robots.txt errors, prevention is definitely better than cure.
On a large revenue-generating website, a stray wildcard that removes your entire website from Google can have an immediate impact on earnings.
Modifications to robots.txt should be carefully made by experienced developers, double-checked, and where appropriate – subject to a second opinion.
If possible, run a test in the sandbox editor before going live on your real world server to ensure you avoid inadvertently causing availability issues.
Remember, when the worst happens, it’s important not to panic.
Diagnose the problem, make the necessary repairs to the robots.txt file, and resubmit your sitemap for a new crawl.
It is hoped that you will regain your place in the search rankings within days.
- Is Google having trouble with large Robots.txt files?
- 7 SEO crawler warnings and errors you can safely ignore
- Advanced Technical SEO: A Complete Guide
Featured image: M-SUR / Shutterstock