KNOWNHOST BLOG

Crawling vs. Indexing + How To Fix

Updated June 4, 2020

Articles about SEO and how to improve site performance will frequently bounce terms about including crawling, indexing, robots.txt, meta robots, canonical and more.

It’s absolutely essential to understand the difference between crawling and indexing, plus why it matters to your site performance in search rankings.

Matt C. at Google explains the basics: Here!

Crawling

Search engines visit a site, often based on links from other sites, check the robots.txt file to see where they’re allowed to go on the site and begin exploring the various pages by following first one internal link and then another. They’ll also use the sitemap provided to give them some indications as to what all they can find.

The robots.txt file is used to let search engines know which pages of the site they shouldn’t crawl and prevent images, video and other media files, as well as any script and CSS style files which you don’t want included in the index.

It does NOT prevent Google from indexing any particular pages.

Read that again…. robots.txt does not prevent Google from indexing any particular page.

Here’s the Google official explanation about robots.txt.

Indexing

In the head of each page are a number of HTML tags such as the language used on the page, character set used, which CSS stylesheets to load, the page title and meta description. It can also include a meta robots tag like:
<meta name=”robots” content=”index, follow”>

A page with the above tag will be a candidate for indexing and the links on the page will be checked by search engine crawlers.

The meta robots tag is covered by Google: Here!

Preventing Indexing

Keeping a page out of the search index means using a modified version of the above:
<meta name=”robots” content=”noindex, follow”>

In this case, search engines are instructed not to index the page content, but should follow links on the page to discover other pages of importance.

Blocking Crawling Means Indexing Directives Won’t Be Followed!

A search engine must be allowed to crawl a page in order to discover the meta robots instructions about indexing.

If a robots.txt file instructs search engines to not crawl a particular page that you don’t want in the index, then it won’t ever see the meta robots tag and won’t know to avoid indexing.

If a page gets in the index (such as when linked to externally), changing the meta robots to noindex won’t work, if search crawlers are blocked due to robots.txt.

The only way to get a page out of the index is to allow crawling (no blocks in robots.txt) and then adding the meta robots noindex tag.

Once you’ve unblocked Google from crawling, by editing any robots.txt entry, and added the meta robots noindex tag, you can request URL removal:

Google Search Console URL Removal tool

How to Edit the robots.txt File

Many site owners are using an SEO plugin of one type or another. If this is the case, there’s a high likelihood that you’ll find a robots.txt editor within the SEO plugin options.

Otherwise, if you’re using cPanel:
1. login to cPanel
2. open File Manager
3. go to the root directory of your site (/public_html/ is often the case)
4. use either the editor or code editor to open the file
5. save changes after performing any edits

Block Design and Development Servers

Many agencies will put new work online so that remote workers can collaborate and clients can view progress. It’s important that these instances block search engines so that duplicate content doesn’t occur.

A blanket robots.txt entry can stop search crawlers from accessing any of the site:
User-agent: *
Disallow: /

This entry tells all bots * to ignore / all pages on site.

DO NOT USE THE ABOVE ON YOUR MAIN SITE!

Double-Checking robots.txt

There are two easy ways of checking the current robots.txt file, without logging in to cPanel.

Visit the Site
www.site.com/robots.txt

This will display the robots.txt file as seen by search crawlers.

Google Search Console robots.txt Tester
Google will report on what it has found and additional info on the impact.

There’s also a submit function where you can let Google know if you’ve updated your robots.txt and that the crawler needs to revisit.

Best Practice for WordPress robots.txt

Advice varies, but a generally accepted approach is to allow all search engines to crawl all pages, except the /wp-admin/ folder, yet still allow public access to admin-ajax.php (which is done because many themes and plugins rely on this, without which the site won’t render or function properly, possibly).

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Looking for web hosting? KnownHost is industry leading in fast and reliable services for any hosting needs! Learn More Here!

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.