22 Mar How To Verify You Have the Proper Robots.txt File
If you have a website or are involved in online marketing, you’ve likely heard about the robots.txt file. When implemented incorrectly, this file can have very negative and unintended consequences, such as blocked pages and resources. Imagine trying to rank for a keyword on a page that Google can’t access. Read this article to find out what a proper robots file looks like and to verify whether your website has one.
What Is It?
The robots file is located at http://www.website.com/robots.txt. It lets search engine crawlers know what parts of your website you do not want them to crawl. It is the very first location of your website that a search engine will visit.
What Does It Look Like?
A robots.txt file contains three main parts and concepts you should understand.
This command dictates which crawlers are allowed to crawl your website. Websites most commonly use * for the user-agent because it signifies “all user agents.”
The three main search engine user agents are:
- “Googlebot” for Google
- “Bingbot” for Bing
- “Yahoo! Slurp” for Yahoo
If you wanted to give a command for a specific crawler, you would place the user-agent ID in the user-agent location. Each time you refer to a crawler, you would need a separate set of disallow commands.
For example, you would list Googlebot as the user-agent and then notify the crawler what pages to disallow.
Most reputable crawlers like Google, Yahoo, and Bing will follow the directive of the robots.txt file. Spam crawlers (that usually show up as traffic to your website) are less likely to follow the commands. Most of the time, using the * and giving the same command to all crawlers is the best route.
This command lets crawlers know which files or pages on your website you do not want them to crawl. Typically, disallowed files are customer-sensitive pages (like checkout pages) or backend office pages with sensitive information.
Most problems in the robots.txt occur within the disallow section. Issues arise when you block too much information in the file. The example above shows an appropriate file to disallow. Any files that begin with /wp-admin/ will not be crawled.
What is the above disallow command telling the crawlers? In this situation, the crawlers are told not to index any of the pages on your website. If you want your website visible in the search engines, then including a single / in the disallow section is detrimental to your search visibility.
If you want to lean on the safe side, let all crawlers crawl every page on your website. You can do this by not disallowing anything.
As you can see, the disallow command is followed by a blank space. When a search crawler sees this, it will go ahead and crawl all pages it finds on your website.
A robots.txt file can also include the location of your website’s sitemap, which I would highly recommend adding. The sitemap, if you have one, is the second place a crawler will visit after your robots.txt file. Make sure the sitemap lists your webpages, specifically the ones you are trying to market.
How Do I Verify that I Set Up My Robots.txt File Correctly?
Within Search Console, you’ll find a tool to test your Robots.txt file.
If there are any problems or errors with your robots.txt file, Search Console will let you know. You can even input specific URLs to make sure Google indexes them.
Remember that a search engine’s role is to a) crawl, b) index, and c) provide results.
The robots.txt file can block pages and sections that a search engine should crawl but not necessarily index. For example, if you create a link and point it to a webpage, Google could crawl that link and index the page that the link points to. Any time that Google indexes a page, it could show up in a search result.
If you don’t want a webpage to show up in a search result, include that information on the page itself. Include the code <meta name=”robots” content=”noindex”> on the specific page you don’t want search engines to index.
The robots.txt file is certainly a more technical aspect of SEO, and it can get confusing. While this file can be tricky, simply understanding how a robots.txt file works will help you verify that your website is as visible as possible.