2. CRAWLING. INDEXING. RANKING – THE THREE MUSKETEERS OF SEO
The crawling phase is all about discovery. The process is really complicated and uses software programs called spiders (or web crawlers). Googlebot is, maybe, the most popular crawler.
The crawlers start by fetching web pages and then follow the links on the page, fetch those pages and follow the links on those pages and so on, up to the point where pages are indexed. For this method, the crawler uses a parsing module, which does not render pages but only analyzes the source code and extracts any URLs found in the <a href=”…”> script. Crawlers can validate hyperlinks and HTML code.
You can help Google and tell the crawler which pages to crawl and which not to crawl. A “robots.txt” file tells search engines whether they can access and crawl your site or just some parts. Using this method, you give Googlebot access to the code data. You should use the robots.txt file to show Google exactly what you want your user to see, because otherwise, you may have pages that will be accessed and don’t want to be indexed. Using this tool, you’ll be able to block or manage various crawlers. Check your robots.txt file to avoid errors and ranking drops. Nowadays, most robots.txt files include the XML sitemap address that increases the crawl speed of bots, which comes as an advantage for your website.
In the crawling process, Googlebot has the main role. On the other side, in the indexing process, Caffeine is indexing infrastructure and has the main role.
Practically, these two phases work together:
- The crawler sends what it finds to the indexer;
- The indexer feeds more URLs to the crawler. And as a bonus, it prioritizes the URLs based on their high value.
The whole concept of the relationship between crawl and index is very well explained by Matt Cutts in the “How Search Works” video:
Once this stage is complete and no errors are found in the Search Console, the ranking process should begin. At this point, the webmaster and SEO experts must put effort into offering quality content, optimizing the website, earning and building valuable links following the quality guidelines from Google. Also, it is very important that the people responsible for this process be informed of the Rater Guidelines.
All the problems began when people started confusing Googlebot (used in the crawling process) with Caffeine (used in the indexing process). Barry Adams talked about the confusion between these two. There’s even a thread on Twitter about it:
It has lots of guide on how search engine optimization works, how developers should design websites and how content writers should create white-hat content. That is how the crawl budget term took birth.
In 2015, 6 years later, Google deprecated their AJAX crawling system and things have changed. The Technical Webmaster Guidelines show that they’re not blocking Googlebot from crawling JS or CSS files and they manage to render and understand web pages.
And there were other problems that needed to be solved. Some webmasters that were using JS framework had web servers that served a pre-rendered page, which shouldn’t normally happen. Pre-rendering pages should follow the progressive enhancement guidelines and have benefits for the user. In another case, it is very important that the content sent to Googlebot matches the content served to the user, both how it looks and how it interacts. Basically, when Googlebot crawls the page, it should see the same content the user sees. Having different content means cloaking, and it is against Google’s quality guidelines.
The progressive enhancement guidelines say that the best approach for building a site’s structure is to use only HTML, and after that play with AJAX for the appearance and interface of the website. In this case, you are insured, because Googlebot will see the HTML and the user will benefit from the AJAX looks.
Google confirmed another change that reflects AJAX. It started with the decision of deprecating their AJAX crawling system, and Roey Skif asked John Mueller on Twitter about the Fetch as Google the hash bang URLs. Then he tested the impact of this change. He saw a lot of blocked resources that were completely different on the hashbang URLs, and that wasn’t aware of them.
It is true, now Google is supporting hashbang URLs, URLS that have the #! in them in (it stopped doing that in March 30, 2014). This is an example of a link of such: http://www.example.com/bla/#!/bla/. The nice part is you can use Fetch as Google for AJAX hash bang’s URLs.
Another thing you could do, besides using Fetch as Google, is to check and test your robots.txt file from the Search Console, too. The Google Webmaster Tool robots tester allows you to check each line and see each crawler and what access it has on your website. If you take a look at the next screenshot you can see how it works:
The information retrieval process includes crawling, indexing and rankings. You surely heard of them before, but what you didn’t know is that lots of people are confused on how crawling and indexing work together and what each process does. We’ve seen that in the crawling phase the website is fetched, then in the indexing phase the site is rendered. Googlebot (the crawler) fetches the website and Caffeine (the indexer) renders the content. The problem started here when most people confused these two and said that the crawler helps Google to index the website.
Upwork is a freelancing website where businesses of all sizes can find talented independent professionals across multiple disciplines and categories. If you’re a business and are looking to get projects done, consider signing up!
This article was submitted by Andreea Sauciuc and originally appeared on CognitiveSEO. It has been republished with permission and does not constitute the views or opinions of Upwork. Find out how to publish your content with Upwork.