Does Google Have A Problem With Big Robots.txt Files?
Google is addressing the topic of robots.txt files and whether it is good SEO practice to keep them to a reasonable size.
This topic was discussed by Google search attorney John Mueller during a Google Search Central SEO Business Hours Hangout taped Jan. 14.
David Zieger, director of SEO for a major news publisher in Germany, joins the livestream with concerns about a “huge” and “complex” robots.txt file.
How big are we talking here?
Zeiger says there are more than 1,500 lines with a “considerable number” of disallowed items that continue to grow over the years.
Disallowing prevents Google from indexing HTML fragments and URLs where AJAX calls are used.
Zieger says it can’t set a noindex, which is another way to keep fragments and URLs out of Google’s index, so he resorted to filling out the site’s robots.txt file with disallowance.
Are there any negative SEO effects of a huge robots.txt file?
SEO Considerations for Large Robots.txt Files
A large robots.txt file will not be generated directly Causing any negative impact on the search engine optimization of the site.
However, a large file is difficult to keep, which can lead to occasional problems in the future.
“There are no direct negative issues with SEO, but it makes it more difficult to maintain. It also makes it much easier to accidentally push something that causes problems.”
So just because the file is large doesn’t mean it’s a problem, but it makes it easier for you to create problems.”
Zieger goes on to ask if there are any issues with not including a sitemap in the robots.txt file.
Mueller says that’s not a problem:
“No. These different ways to submit a sitemap are all equivalent to us.”
Zieger then starts asking several more follow-up questions that we’ll look at in the next section.
Related: Google SEO 101: Block private files in robots.txt
Does Google recognize HTML fragments?
Zieger Mueller asks about the SEO impact of rooting robots.txt. Like removing all prohibitions for example.
The following questions are asked:
- Does Google recognize parts of HTML that are not relevant to site visitors?
- Will HTML fragments end up in the Google search index if they are not blocked in robots.txt?
- How does Google handle pages that use AJAX calls? (such as a header or footer element)
He summarizes his questions by stating that most of the things that are not allowed in robots.txt are header and footer elements that are not interesting to the user.
Mueller says it’s hard to know exactly what would happen if these fragments were suddenly allowed to be cataloged.
A trial-and-error approach might be the best way to find out, Mueller explains:
“It’s hard to say what you mean with regard to those fragments
I think there will be an effort to learn how to use hashed URLs. And if you’re not sure, maybe take one of those hashed URLs and allow it to be crawled, look at the content of that fragmented URL, and then check to see what’s going on in the search.
Does it affect anything about the indexed content on your site?
Could some of this content be found within your site all of a sudden?
Is this a problem or not?
And try to work around that, because it’s very easy to block things by robots.txt, that aren’t actually used for indexing, and then spend a lot of time maintaining that big robots.txt file, but it doesn’t actually change that much for your website” .
Related: Best practices for setting up Robots and Robots.txt Meta tags
Other considerations for building a Robots.txt file
Zieger has a recent follow up on robots.txt files, asking if there are any specific guidelines to follow when creating one.
Mueller says there is no set format to follow:
No, it’s basically up to you. Like some sites have large files, some sites have small files, all of which should just work.
We have open source code for the robots.txt parser that we use. So what else you can do is get the developers to run that parser for you, or set it up so that you can test it, and then check the URLs on your website with that parser to see which URLs will actually be blocked and what will change. That way you can test things out before you make them live.”
The robots.txt parser Mueller refers to can be found at github.
Hear the full discussion in the video below:
Featured image: Screenshot from YouTube.com/GoogleSearchCentral, January 2022.