Understanding Google's Planned Expansion of Unsupported Robots.txt Rules
In an exciting development for those managing websites, Google may soon expand its list of unsupported robots.txt rules. Using data gathered from HTTP Archive, Google is working to analyze the most commonly used unsupported directives, ensuring its documentation aligns with real web usage.
The project, outlined by Google engineers Gary Illyes and Martin Splitt in a recent episode of Search Off the Record, originated from a community user's proposition to add specific tags to the unsupported list. The engineers noticed an opportunity to examine broadly used unsupported rules, aiming to document around 10 to 15 of the most prevalent directives.
How It All Began: Data-Driven Decisions
The research team focused on robots.txt files, analyzing what rules are actually applied across millions of sites via monthly crawls from HTTP Archive. Each exploration previously encountered a significant issue: most crawlers do not request robots.txt files by default. Thus, they created a custom parser to extract the rules, enriching their dataset and making it accessible for further queries on Google BigQuery.
The resulting data showed a stark falloff in usage past the three primary elements recognized by Google—user-agent, allow, and disallow. This finding indicates a need for clearer guidance on how to correctly implement more complex rules while avoiding broken or misleading commands that don't yield preferred results.
Why This Matters for SEO Practitioners
As the robots.txt file plays a crucial role in SEO by directing search engines on how to interact with a site, understanding these updates is vital. Currently, Google only recognizes four fields: user-agent, allow, disallow, and sitemap, which leaves many website owners in the dark about unsupported directives.
By potentially including the top unsupported rules in documentation, Google aims to reduce misunderstandings among SEOs and developers regarding how they should construct their robots.txt files. This is particularly important as many webmasters have been using unsupported fields to manage crawling behavior.
Addressing Typos: A Step Towards User-Friendliness
Another noteworthy element of this expansion is Google's commitment to reassess how it handles common misspellings of the disallow rule, such as "dishallow." Gary Illyes hinted at developing more typo tolerance in Google's parsing behavior, which could significantly aid those less acquainted with technical SEO rules.
This leniency means that a website that made a typo still stands a chance to have its crawling directives recognized, thus preventing indexing issues arising from simple mistakes that could cost visibility in search results.
Looking Ahead: Prepare Your Robots.txt Files
For SEOs and developers, the upcoming changes highlight the importance of regularly auditing robots.txt files. Anyone managing such files should ensure all present directives function correctly per Google's specifications—effectively reducing the risk of ignored client needs due to unsupported commands.
As Google aims to make its documentation reflect authentic practices observed online, those updating their robots.txt need to check for any outdated or ineffective commands. Users can also harness the HTTP Archive data, available publicly via BigQuery, to enrich their understanding of current standards and typical missteps in others' configurations.
Conclusion: Taking Action for Better Visibility
In summary, as Google gears up for a potential overhaul of its unsupported robots.txt directives list, website managers are advised to stay proactive. Regularly auditing robots.txt files, reviewing documentation, and understanding common missteps can aid in maintaining a site's visibility and effectiveness in search engines. The forthcoming updates could substantially streamline how SEOs approach their strategies, making what was once unclear a lot clearer.
Write A Comment