How to Build a robots.txt File

First of all, it’s not a necessity for every website to have a robots.txt file. This file exists to tell the search engines you want them to limit their access to specific pages or directories of a website, so without one, the entire site would conceivably be indexed. It can address either all or just specific bots.

Depending upon the site, of course, allowing the search engine to index any and all pages could present a problem. For instance, if an ecommerce site with on-site search function doesn’t limit access at all, then every single search would result in a new URL being created – which could appear in Google’s search results (as well as any other search engine’s SERPs). This can easily result in a site with only 400-ish pages having tens of thousands of pages in the index. That is not necessarily a “good thing”.

Most sites, therefore, will want to exclude some pages from crawlers. While there are other methods of accomplishing this, the most common is via directives (really, they’re just requests) in the robots.txt file. The syntax used in the file are very specific, dealing with disallowing access to directory or page-level URLs. Below, we’ll show you how to create a robots.txt file, where to put it on your server and how to herd the bots through your site.

Where to put your robots.txt file

The location of the robots.txt file is the simplest aspect… it must be in the root of your domain. If it’s located anywhere other than http://your-domain.com/robots.txt, it won’t accomplish anything, as the crawlers won’t see it as a robots directive file.

How to create the file

The easiest way to use a text editor such as Notepad to create a file named robots.txt. Then you’ll simply identify the user-agents you want to direct. For some areas of your site, this may mean all user-agents, in which case, you’ll use a wild-card character (*).

User-agent: *

This will apply to all user-agents. If you want to exclude a specific bot, such as Googlebot-Image, you simply substitute the user-agent for the wild-card.

User-agent: Googlebot-Image

Next, you’ll list the directories or URLs you want to exclude the bot from crawling, such as this:

User-agent: Googlebot
Disallow: /wp-admin/

This tells the user-agent Googlebot not to crawl the wp-admin folder or its contents. Of course, it’s possible that you might want to allow access to only one page of a directory, in which case you can disallow access to the directory, then follow it with an allow directive for a specific file within that directory, such as:

User-agent: Googlebot
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

This will direct Googlebot to crawl only the admin-ajax.php file in the wp-admin directory. When you have multiple user-agents that you want to direct, you’ll need to add them individually, like this:

User-agent: Googlebot-Mobile
Disallow: */limit-*/*
User-agent: AdsBot-Google
Disallow: */limit-*/*
User-agent: Slurp
Disallow: */limit-*/*
User-Agent: Bingbot
Disallow: */limit-*/*
User-agent: MSNBot
Disallow: */limit-*/*

If your site is large and updates often, such as a news site, you can also add delays to crawling to prevent how often an individual bot can recrawl. These delays, too, are user-agent specific and will need to be added for each.

User-agent: Googlebot-Mobile
Disallow: */limit-*/*
User-agent: AdsBot-Google
Disallow: */limit-*/*
User-agent: Slurp
Crawl-delay: 300
Disallow: */limit-*/*
User-Agent: Bingbot
Crawl-delay: 120
Disallow: */limit-*/*
User-agent: MSNBot
Crawl-delay: 120
Disallow: */limit-*/*

Finally, the robots.txt file is where you’ll want to point the search engines to your sitemap.xml file.

User-agent: Googlebot-Mobile
Disallow: */limit-*/*
User-agent: AdsBot-Google
Disallow: */limit-*/*
User-agent: Slurp
Crawl-delay: 300
Disallow: */limit-*/*
User-Agent: Bingbot
Crawl-delay: 120
Disallow: */limit-*/*
User-agent: MSNBot
Crawl-delay: 120
Disallow: */limit-*/*
Sitemap: https://www.your-domain.com/sitemap.xml

(don’t forget to substitute your site’s sitemap URL)

That’s just the basics of a robots.txt file. Bear in mind, if you don’t know what you’re doing, you can block things you don’t want blocked, or leave things open to crawling and indexing that you’d prefer didn’t show up in the SERPs. So until you’re comfortable with it, it’s a good idea to double-check. You can also test your robots.txt with Google, at https://support.google.com/webmasters/answer/6062598.

Remember, disallowing bots in your robots.txt file will only keep out the bots that are programmed to obey. And this isn’t a method of providing any sort of security for sensitive pages. If the bots can arrive via a link from another page, they’ll still be able to crawl the page.

The following two tabs change content below.
Doc Sheldon has worked in marketing since the 1980s and he's been writing professionally since the 1970s. He owned and published weekly and monthly newspapers and magazines during the ’80s, before becoming a business consultant and ultimately "retiring" in 2008. He began studying SEO in earnest in 2003, and now specializes in technical SEO. His passions are the development of the Semantic Web, trying to figure out what changes may be coming next from the search engines and eliminating misinformation.