Robots TXT

What it is, what it is for and how it works

What the robots.txt file is

Robots.txt is a text file that is incorporated into the root domain to instruct search engine robots how to crawl the pages of your website. In practice, robots.txt files tell web crawlers whether or not to crawl certain folders on a website. These crawling instructions are specified by “disallowing” or “allowing” access to all or none, or to certain crawlers.

Why it is important

The robots.txt file controls the access of crawlers to certain areas of your site. It allows us to control the crawling budget and also to indicate to the search engines the parts of our site that we do not want them to explore. This way we prevent wasting search engine resources. The robots.txt file is very useful for these situations:

Keep entire sections of a website private (e.g., your engineering team’s staging site).
Avoid internal search results pages from appearing in a public SERP.
Specify the location of sitemaps.
Prevent crawling and indexing of certain files on your website (images, PDFs, etc.).
Specify a crawl delay to prevent your servers from being overloaded when crawlers load multiple pieces of content at once.
Prevent duplicate content from appearing in the SERPs.

However, to deal with duplicate content it is better to use meta robots and canonicals tags.

Robots txt example

				
					User-agent: *
Disallow: /wp-admin/
Disallow: /?s=
Allow: /wp-admin/admin-ajax.ph

Sitemap: https://www.isocialweb.agency/post-sitemap.xml
Sitemap: https://www.isocialweb.agency/page-sitemap.xml

Where to find Robots txt file

Whenever a bot or web crawler targets a site, be it Googlebot, Facebook web crawler (Facebot), or any other, they go directly to look for the robots.txt file.

And they are always going to look for it in the same place: the home directory.

That is:

www.ejemplo.com/robots.txt

If an agent or bot visits this address by default but does not find a robots file there, it will assume that the site does not have one and will proceed to crawl everything on the page.

Even if the robots.txt page existed, but in another location, no crawler would bother to look for it and therefore the site would be treated as if it had no robots file.

To ensure that the robots.txt file is found, always include it in your home directory or root domain.

Crawl the web to discover content;
Index that content so it can be provided to search engines that are looking for information.

To crawl sites, search engines follow links to get from one site to another and ultimately crawl billions of links and websites. This crawling behavior is sometimes referred to as “spidering”.

After arriving at a website, but before crawling it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before it continues crawling through the page.

Since the robots.txt file contains information about how the search engine should crawl, the information found in it will instruct the crawler’s action on this particular site.

If the robots.txt file does not contain any directives that prevent user agent activity (or if the site does not have a robots.txt file), it will proceed to crawl other information on the site.

Robots txt syntax: What should it contain?

Robots.txt syntax can be considered as the “language” of robots.txt files.

There are five common terms that you are likely to find in a robots file.

They are as follows:

1. User-agent:

It indicates the name of the specific web crawler you are giving crawl instructions to. You can find a list of most user-agents here: Googlebot, Googlebot-Image, Bingbot, Slurp, Baiduspider, DuckDuckBot

2. Disallow:

The command is used to tell a user agent not to crawl a certain URL. Only one “Disallow:” line is allowed for each URL.

3. Allow:

Only applicable to Googlebot. This command tells Googlebot that it can access a page or subfolder even if its parent page or subfolder is disabled.

4. Crawl-delay:

The number of seconds a crawler should wait before loading and crawling the page content. Note that Googlebot does not recognize this command, but the crawl-delay can be configured in Google Search Console.

5. Sitemap:

This is used to call the location of any XML sitemap associated with this URL. Note that this command is only supported by Google, Ask, Bing and Yahoo.

How the robots.txt file is created

It’s easy to create a robots.txt file, you just need to know a few specific commands. You can create this file using your computer’s notepad or any other text editor you prefer.

It is also necessary to have access to the main folder of your domain, since it is there where you must save the file that you have created. To create a robots.txt file, you need to go to the root of your domain and save the file there.

Best practices in the use of Robots txt:

To ensure that the robots.txt file is found, always include it in your home directory or root domain.
Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or anything else).
Some user agents (robots) may choose to ignore the robots.txt file. This is especially common with less ethical crawlers, such as malware bots or email address scrapers.
The /robots.txt file is publicly available: just add /robots.txt to the end of any root domain to see the directives for that website. This means that anyone can see which pages you do or don’t want to be crawled, so don’t use them to hide private user information.
Each subdomain of a root domain uses separate robots.txt files. This means that both blog.example.com and example.com should have their own robots.txt files (at blog.example.com/robots.txt and example.com/robots.txt).
It is generally good practice to indicate the location of any sitemap associated with this domain at the bottom of the robots.txt file.

In conclusion:

The robots.txt file is part of the Robots Exclusion Protocol (REP), a set of web rules that govern how robots crawl the web, access and index content, and deliver that content to users.

The truth is that robots.txt files are an aid for search engines, and having this file always updated will help them to know better how to treat the different sections of your website.

This way we control the crawl budget.

Important: To ensure that your robots.txt file is found, always include it in your main directory or root domain. Also, keep in mind that this file is case-sensitive and can be ignored by malicious bots. So, never include instructions to block the crawling of private parts of your website. In these cases restrict access by using passwords or permissions from the server.

Robots.txt tester:

Links and recommended readings:

Robots.txt dictates the crawling behavior of the entire site or directory.
Metarobots.txt and x-robots dictate indexing behavior at the individual page (or page element) level.