People always ask what is robots.txt? It is simply a text file. It provides directives, or instructions, to search engine crawlers about which parts of your site may be crawled and which parts may not be crawled. By allowing search engines to crawl the content on your site, you are also allowing them to index the content from your site. Indexing allows your content to be visible in search results.
If you are interested in more options for crawling a page without indexing the content – perhaps you want the crawler to follow the links on a page to reach other content you do want indexed – then this meta robots article is a good resource.
The first thing a search engine crawler does when it lands on a website is look for the existence of a robots.txt file. The directives in this file are assimilated before the crawler proceeds through the page.
Robots.txt is not a website requirement. If you do not have specific criteria for user-agents, then you likely don't need to include one. If it doesn't exist, the crawler assumes all pages and content on the site can be crawled and indexed by search engines.
While this is what robots.txt is and describes what it does, there are other things you need to understand before you move forward with creating or editing your own robots.txt file.
The file must be placed in the site's root directory for crawlers to find it.
The easiest way to determine if you have a properly installed robots.txt file on your website is to type https://your-domain.com/robots.txt in your browser's address bar.
If it exists, you will see content that resembles something like this:
User-agent: *
Disallow:
Otherwise, if no robots.txt exist, your browser will return a "page not found" error.
A single robots.txt file may contain one or many sets of directives. Multiple sets must be separated by a single blank line. There can be no blank lines in the sequence of a set.
A set begins with a user-agent and then is followed by one or more directives.
User-agent: Googlebot
Disallow: /*private-html/
Disallow: /cgi-bin/
Sitemap: https://my-domain.com/sitemap.xml/
Each directive only applies to the user-agent identified in the set. Here are the four primary directive options:
There may be times when rules apply to more than one user-agent. In this case, the most specific set of instructions will be honored over all others for the user-agent.
Common search engine user-agents include:
It is worth noting that there are many types of user-agents out there, but the only ones of consideration in robots.txt are search engine crawlers. Remember, we are instructing search engine crawlers how to proceed with crawling and indexing our content in robots.txt.
Even though you might give specific directives in robots.txt, it is still up to individual crawlers to interpret them. Technically, user-agents can choose to not adhere to the directives defined, (although, these are generally not upstanding user-agents, but rather malware bots and such).
This asterisk is very significant. This special character indicates something all-encompassing.
For examples:
You need a text editor. Some popular text editors are Notepad, TextPad, Brackets, and Atom. There are many text editors to choose from, many of which are free downloads.
In addition to writing content crawl directives, robots.txt is a very effective method for telling search engines where your sitemap(s) are located.
For example:
User-agent: *
Sitemap: https://www.my-domain.com/sitemap.xml
In this example, the directive is telling all search engine crawlers that the sitemap to follow is located in the website root directory in a file called sitemap.xml.
To allow search engine crawlers to crawl everything on your site:
User-agent: *
Disallow:
To disallow, or forbid, search engine crawlers from crawling everything on your site:
User-agent: *
Disallow: /
If you want to stop Google Image Search from crawling and indexing the photos on your site:
User-agent: Googlebot-Image
Disallow: /
If you want to block single web pages from being crawled and indexed:
User-agent: *
Disallow: /private_page.html/
Disallow: /private/financial_docs.xls/
If you want to disallow portions of a server from robots:
User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
If you want to disallow specific file types from being crawled:
User-agent: *
Disallow: /*.pdf$/
If you want to disallow any filename that contains a particular character sequence:
Contains private within the filename:
User-agent: *
Disallow: /*private*/
A filename that begins with private:
User-agent: *
Disallow: /private*/
A filename that ends with private:
User-agent: *
Disallow: /*private/
For specific questions regarding search directives, please feel free to reach out to us via the comments section below.
To learn more about our web services, check out our Web Design page. If you are interested in our other digital services or would like a quote, we would love to discuss your project with you. Call us at 904-330-0904.