How To Verify You Have the Proper Robots.txt File

robots.txt bot

This article was last updated on 5/17/21

If you have a website or are involved in online marketing, you’ve likely heard about the robots.txt file. When implemented incorrectly, this file can have very negative and unintended consequences, such as blocked pages and resources. Imagine trying to rank for a keyword on a page that Google can’t access. In this article, we will cover details including why a robots.txt file is important, how to access a robots.txt file, and much more!

Table of Contents

 

What is a Robots.txt File and Why is it Important?

A robots.txt file allows search engine crawlers to better understand what parts of your website you do and do not want them to crawl. It is the very first location of your website a search engine will visit.

Why is it important?

  • It prevents duplicate content from showing in the SERPs.
  • You have more control over what you want a search engine to see or not see.
  • A sitemap can be added directly into the robots.txt file helping search engines to better read and understand the blueprint of your website.
  • Helps to prevent certain pages, images, files, etc. from being indexed and taking up crawl budget space.
  • If you have multiple content pieces that load at once, setting a crawl delay in the file can help servers from being overloaded.

How to access a robots.txt file

If you’re not sure whether your website has a robots.txt file, it’s easy to check! Access your robots.txt file by adding /robots.txt at the end of your domain. Here’s an example of ours at Boostability: https://www.boostability.com/robots.txt

If you don’t see anything when you try this out for your website or are taken to a 404 error page, you don’t have one. The next step would be to prioritize creating a robots.txt file for your website!

 

What Should a Robots.txt File Look Like?

At the very minimum, a robots.txt file should contain three main parts and concept you need to understand.

User-agent

robotsoverview

 

 

This command dictates which crawlers are allowed to crawl your website. Websites most commonly use * for the user-agent because it signifies “all user agents.”

With the rise of new search engines entering the market and depending on your location, the list of search engine user-agents can get long. Below is a list of some of the main search engine user agents:

  • “Googlebot” for Google
  • “Bingbot” for Bing
  • “Yahoo! Slurp” for Yahoo
  • “Yandex” for Yandex
  • “Baidu” for Baidu
  • “Duckduckgo” for DuckDuck Go
  • “Aolbuild” for AOL

If you wanted to give a command for a specific crawler, you would place the user-agent ID in the user-agent location. Each time you refer to a crawler, you would need a separate set of disallow commands.

For example, you list Googlebot as the user-agent and then notify the crawler what pages to disallow.

Most reputable crawlers, like Google, Bing, and Yahoo, will follow the directive of the robots.txt file. Spam crawlers (that usually show up as traffic to your website) are less likely to follow the commands. Most of the time, using the * and giving the same command to all crawlers is the best route.

Disallow

Sitemaplocation

 

 

This command lets crawlers know which files or pages on your website you do not want them to crawl. Typically, disallowed files are customer-sensitive pages (like checkout pages) or backend office pages with sensitive information.

Most problems in the robots.txt occur within the disallow section. Issues arise when you block too much information in the file. The example above shows an appropriate file to disallow. Any files that begin with /wp-admin/ will not be crawled.

The example below is what you do not want to include in the disallow section.

robotsblocking

 

 

What is the disallow command telling the crawlers in the picture above? In this situation, the crawlers are told not to index any of the pages on your website. If you want your website visible in the search engines, then including a single / in the disallow section is detrimental to your search visibility. If you notice a sudden drop in traffic, check your robots.txt file first to see if this issue is present.

Google even sends out Search Console messages letting websites know if the robots.txt file blocks information it needs to crawl, like CSS files and Javascript.

searchconsoletestrobots

 

 

 

 

 

 

 

 

 

 

 

If you want to lean on the safe side, let all crawlers crawl every page on your website. You can do this by not disallowing anything.

robotsnonblocking

As you can see, the disallow command is followed by a blank space. When a search crawler sees this, it will go ahead and crawl all pages it finds on your website.

Sitemap

A robots.txt file can also include the location of your website’s sitemap, which is highly recommended. The sitemap is the second place a crawler will visit after your robots.txt file. It helps search engines better understand the structure and hierarchy of your website. Make sure the sitemap lists your webpages, specifically the ones you are trying to market or your most valuable pages.

Note: If you have multiple sitemaps for your website, add them all into the robots.txt file – you can list out more than one.

 

How to Check if your Robots.txt File is Working or Not

It’s a good practice to check if the robots.txt file for your website is working or not using your Search Console account. Use this tool to test your robots.txt file or individual URLs when in doubt.

Below is a screenshot of what it looks like. You just need to input your robots.txt file or the specific URL you want to test and it will tell you whether it’s accepted or blocked.

robots.txttester

 

 

 

 

 

 

If there are any problems or errors with the robots.txt file for your website, Search Console will let you know. Remember that a search engine’s role is to a) crawl, b) index, and c) provide results.

The robots.txt file can block pages and sections that a search engine should crawl but not necessarily index. For example, if you create a link and point it to a webpage, Google could crawl that link and index the page that the link points to. Any time that Google indexes a page, it could show up in a search result.

If you don’t want a webpage to show up in a search result, include that information on the page itself. Include the code <meta name=”robots” content=”noindex”> in the <head> tags on the specific page you don’t want search engines to index. Or you can add a list of pages to be noindexed directly in the robots.txt file, keep reading to learn how.

 

How to Create a Robots.txt File that is Simple and SEO-Friendly

In case you don’t have a robots.txt file for your website, not to worry because you can make one! For this purpose, we will show you how to create an SEO-friendly robots.txt file in a few simple steps.

We recommend getting some assistance from a trusted web developer if you need it because this does require working within the website’s source code. 

1. Use a plain text editor

For a Windows, use Notepad; for a Mac, use TextEdit. Avoid using Google Docs or Microsoft Word because they can insert code that you don’t intend to have in the file.

2. Assign a user-agent

As we mentioned above, most sites will typically allow all search engines to access their websites. If you choose to do this, simply type in:

User-agent: *

user agent robots.txt example

If you want to specify rules for different user-agents, you will need to separate the rules into multiple user-agents. For example, let’s take a look at SEMrush’s robots.txt file below:

user agent robots.txt example

They have listed out specific rules for different user-agents. SEMrush has a specific rule for Google’s user-agent compared to the pages it doesn’t want Bing’s user-agent to crawl. If you find yourself in a similar situation, follow the structure above and separate the rules into separate lines in your plain text editor.

3. Specify the disallow rules

To keep it as simple as possible for this scenario, we will not add anything to the disallow. Or you can choose to not include a disallow section and just leave in the user-agent rule. This means that search engines will crawl everything on the website.

robots.txt disallow example

To make your robots.txt file even more SEO-friendly, adding pages that site visitors don’t typically engage with into the disallow section is a good practice because this can help clear up the crawl budget. An example for a WordPress site would be:

robots.txt disallow example

The picture above demonstrates that WordPress admin pages (or backend pages) should not be crawled by any user-agent, in addition to the Thank You page (this ensures that only qualified leads will be accounted for, not accidental visitors that can access the page through a SERP). By filtering these kinds of pages from the crawl budget, you can put more attention into the valuable pages you want search engines to crawl and people to visit.

4. Add your sitemap

Last but not least, don’t forget to add your sitemap(s) as you finish creating your robots.txt file. List it out at the bottom after the Disallow section.

robots.txt sitemap

5. (Optional) Noindex pages within the robots.txt

This isn’t a necessary step, but it can be useful to add a noindex section to your robots.txt. As mentioned earlier, the robots.txt does not automatically noindex pages. It just tells search engines which ones to not crawl. If you have certain pages that you don’t want to be indexed (i.e. Thank You or Confirmation pages), you can update the meta tag on the page directly or add it to the robots.txt file as shown in the example below:

robots.txt disallow example

6. Submit it to the root directory

Once you’re finished creating your robots.txt file, the last step is to upload it into the root directory of your website. Once it’s uploaded, navigate to your robots.txt file and see if the page loads on the search engine. Then test out your robots.txt file using Google’s robots.txt tester tool.

 

Make Your Robots.txt File SEO-Friendly

The robots.txt file is certainly a more technical aspect of SEO, and it can get confusing. While this file can be tricky, simply understanding how a robots.txt file works and how to create one will help you verify that your website is as visible as possible. It’s a powerful tool that can be used to take your SEO strategy even further.

But if you need help with your robots.txt file or any other part of the SEO campaign, we’re here to help! Boostability helps small businesses boost their SEO campaigns, increase online visibility, and stay relevant against other competitors. Let us help you take your SEO strategy to the next level!

Maja Teagle

Maja is the former SEO Manager for the marketing team at Boostability. After graduating from the University of Utah with a degree in Marketing, her focus has been on expanding her knowledge and skill set in SEO. Prior to joining the corporate marketing team at Boostability, Maja gained experience working at several digital marketing agencies in Salt Lake City, focusing on SEO strategy development and fulfillment, as well as client account management. Working closely with clients ranging from small businesses to enterprise organizations, she has managed and executed SEO strategies for over 20 different company websites. Outside of work, Maja loves to go on hikes with her husband and dog, play volleyball, bake and cook, and try new restaurants throughout the city (she considers herself a fry-connoisseur).