Even though SEO specialists put most of their effort into improving the visibility of pages for their corresponding keywords, in some cases it’s required to hide certain pages from search engines.
Let’s find out a bit more about this topic.
What is a robots.txt file?
Robots.txt is a file that contains the areas of a website that search engine robots are forbidden from crawling. It lists the URLs that the webmaster doesn’t want Google or any search engine to index and prevents them from visiting and tracking the selected pages.
When a bot finds a website on the Internet, the first thing it does is check the robots.txt file in order to learn what it is allowed to explore and what it has to ignore during the crawl.
To give you a robots.txt example, this is its syntax:
# All bots - Old URLs
What is robots.txt in SEO
These tags are required to guide the Google bots when finding a new page. They are necessary because:
- They help optimize the crawl budget, as the spider will only visit what’s truly relevant and it’ll make better use of its time crawling a page. An example of a page you wouldn’t want Google to find is a “thank you page”.
- The Robots.txt file is a good way to force page indexation, by pointing out the pages.
- Robots.txt files control crawler access to certain areas of your site.
- They can keep entire sections of a website safe, as you can create separate robots.txt files per root domains. A good example is –you guessed it- the payment details page, of course.
- You can also block internal search results pages from appearing on the SERPs.
- Robots.txt can hide files that aren’t supposed to be indexed, such as PDFs or certain images.
Where do you find robots.txt
Robots.txt files are public. You can simply type in a root domain and add /robots.txt to the end of the URL and you’ll see the file…if there is one!
Warning: avoid listing private information in this file.
You can find and edit the file at the root directory on your hosting, checking the files admin or the FTP of the website.
How to edit robots.txt
You can do it yourself
- Create or edit the file with a plain text editor
- Name the file “robots.txt”, without any variation like using capital letters.
It should look like this if you want to have the site crawled:
Notice that we left “Disallow” empty, which indicates that there’s nothing that is not allowed to be crawled.
In case you want to block a page, then add this (using the “Thank you page” example):
- Use a separate robots.txt file for each subdomain.
- Place the file on the website’s top-level directory.
- You can test the robots.txt files using Google Webmaster Tools before uploading them to your root directory.
- Take note that FandangoSEO is the ultimate robots.txt checker. Use it to monitor them!
See it isn’t so difficult to configure your robots.txt file and edit it anytime. Just keep in mind that all you really want from this action is to make the most of the bots visits. By blocking them from seeing irrelevant pages, you’ll ensure their time spent on the website will be much more profitable.
Finally, remember that the SEO best practice for robots.txt is to ensure that all the relevant content is indexable and ready to be crawled! You can see the percentage of indexable and non-indexable pages among the total pages of a site using FandangoSEO’s crawl, as well as the pages blocked by the file robots.txt.
Robots.txt use cases
The robots.txt controls the access of the crawler to some areas of the website. This sometimes can be risky, especially if the GoogleBot is accidentally not allowed to crawl the entire site, but there are situations where a robots.txt file can be handy.
Some of the cases in which it is advisable to use robots.txt are the following
- When you want to maintain the privacy of some sections of a website, for example, because it’s a test page.
- To avoid duplicate content appearing on the Google results page, although meta-bots are an even more desirable option for this purpose.
- When you do not want internal search result pages to appear on a public result page.
- To specify the location of the site maps.
- To prevent search engines from indexing certain files on the website.
- To indicate a crawl delay to avoid server overload when crawlers load several content pieces at once.
If there are no areas on the site where you want to control user-agent access, you may not need a robots-txt file.
Robots.txt SEO Best Practices
Follow these tips to manage the robots.txt files properly:
Don’t block content you want to be tracked
Nor should you block sections of the website that should be tracked.
Keep in mind that the bots will not follow the links of the pages blocked by robots.txt
Unless they are also linked from other pages that search engines can access because they have not been blocked, the linked resources will not be crawled and may not be indexed.
Also, no link value can be passed from the blocked page to the link destination. If you have pages to which you want to give authority, you must use a blocking mechanism other than robots.txt.
Do not use robots.txt to avoid showing confidential data on the search engine results page
Other pages can link directly to the page containing confidential information (thus avoiding the robots.txt guidelines in your root domain or home page), which is why it can still be indexed.
You should use a different method, such as password protection or the noindex meta tag, to prevent the page from appearing in Google search results.
Remember that some search engines have multiple user agents
Google, for example, uses GoogleBot for organic search and GoogleBot-Image for image search.
Most user agents from the same search engine follow the same rules, which is why you don’t have to specify guidelines for every search engine crawler, but doing so allows you to control how the site content will be crawled.
The search engine caches the content of the robots.txt but usually updates the cached data daily
If you change the file and want to update it faster, you can send the robots.txt URL to Google.
Robots.txt file limitations
Finally, we are going to see which are the aspects that limit the function of the robots.txt file:
Pages will continue to appear in search results
Those pages that are inaccessible to search engines because of the robots.txt file but have links to them may still appear in search results from a crawlable page.
Only contains directives
Google highly respects the robots.txt file, but it is still a directive and not a mandate.
Google supports a limit of 521 kilobytes for robots.txt files, and if the content exceeds this maximum size, it can ignore it. We don’t know if other search engines also set a limit for these files.
Robot txt. is cached 24 hours
According to Google, the robots.txt file is usually cached for up to 24 hours. Something to keep in mind when making changes to the file.
It is not entirely clear how other search engines handle the cached file, but it is best to avoid caching your robots.txt so that search engines do not take longer to detect changes.5xx Server Errors Meta Robots Tag