Technical SEO: A Simple Guide to Crawling, Indexing, and Ranking

All you have to do in technical SEO is make sure the right pages and resources are discovered, crawled, indexed, and ultimately ranked. Easy; right?

Wednesday, August 23, 2023
Updated:
The SEO pyramid showing that technical SEO is the foundation of ranking in search engines

SEO is a complex, but at its core, there are three essential pillars: Crawling, Indexing, and Ranking. Understanding these concepts is actually the most straightforward thing in SEO. Also, technical SEO is arguably the most important thing in SEO too. If you enjoyed our 7 things to do to get your content indexed guide, you'll enjoy this too!

Let's walk through Crawling, Indexing, and Ranking (AKA Technical SEO), and why you should care. Each of them have a section below, and then we'll wrap things up with a cheatsheet.

The 3 Simple Phases of Technical SEO

1. Crawling: “The Discovery Phase”

Crawling is how we refer to the process of search engines discovering and reading a website's pages. Various bots, such as smartphone (mobile) bots, desktop bots, Adsbot, and Imagebot, are used to explore and read a site.

  • Why you should care: If a page never makes it through this step, it will never show up in Google Search.
  • Easy way to see if you might have crawling issues: Open up your Google Search Console property, navigate to <code>Pages</code>. You'll find this in the "Indexing" section. The line item you're looking for here will be "Discovered - currently not indexed". This means that Google knows about the pages listed here, but it hasn't crawled them yet. This doesn't necessarily mean there's a crawling problem, but this is where you should start your investigation.
how to navigate to page indexing report in Google Search Console
There's many more checks you should do, especially if you're dealing with a large website, but this method is great for beginners.

Things that influence crawling

  • Robots.txt Directives: Communicate which pages we want crawled.
  • HTML Standards (e.g. semantic HTML): Ensure links are seen and can be crawled.
  • IP blocking: For example, Google crawls from many different IP addresses based all around the world, so if you’re trying to prevent traffic based on IP address, make sure you’re not blocking the important bots from crawling.
  • Internal Linking: Aids in discovering URLs in basic implementation, and can discourage crawling with a no-follow attribute.

Possible crawling issues you might encounter

  • Website Errors: most commonly 403, 404 or 410 status codes 
  • Server Errors: Too many 5xx status codes
  • JS & CSS loading issues: This is so common, and it absolutely affects how Google sees the page.

Helpful Robots.txt Tools

2. Indexing: “The Organization Phase”

Indexing is the term we use to describe the process of adding (or not adding) a web page to a search engine's database. It only occurs after a page is crawled.

  • Why you should care: If a page never makes it through this step, it will never show up in Google Search.
  • Easy way to see if you might have crawling issues: Open up your Google Search Console property, navigate to <code>Pages</code>. You'll find this in the "Indexing" section. The line item you're looking for here will be "Crawled - currently not indexed". This means that Google has crawled the pages listed here, but it hasn't indexed them yet. This doesn't necessarily mean there's an indexing problem, but this is where you should start your investigation.
how to navigate to page indexing report in Google Search Console
Quick way to check for potential indexing issues in Google Search Console

Indexing Tools and Techniques

  • Meta Robots Tags: Directs the desire for a page to be indexed. This is implemented in the <code>head</code> of each URL.
  • Canonical Tags: Helps mitigate duplicate content from getting indexed.
  • XML Sitemap file(s): Helps Google see which pages you're communicating you want indexed.
  • RSS Feed(s):Quick way to communicate with crawlers to communicate new content. - publishers take note!
  • Helpful Content: Search engines try to filter out the noise and have systems in place to help detect helpful content versus content that only exists for SEO.

Common Indexing Issues to Look for

  • Important Pages May Be No-Indexed: Check manually and confirm in Google Search Console. Look for "excluded by 'noindex' tag" in Pages report in Google Search Console
  • Crawling Issues: Remember that if a page hasn't been crawled then is absolutely won't be indexed. Refer back to section 1.
  • JS & CSS Loading Issues: If the resources on the page aren't being crawled, but the page is, then maybe the crawler can't see your content. This is most common with JavaScript websites.
  • Bad Content: Maybe your content just sucks and you need to work on making it better. Just kidding, your content doesn't suck, but I bet you can make it better.

How to Check a URL's Google Indexation Status

  • Manually: Copy URL into search engine search bar - site: {{URL}}. Example here.
  • Google Search Console: Throw any URL into the top search bar in Google Search Console and hit enter on your keyboard. Within seconds you'll get the status for that specific URL. There's also a ton of other helpful information in this report that we will not cover in this post.
  • Mobile Friendly Tool: Test a live URL to see if there might be loading/crawling issues for the CSS, JavaScript, and Images on a page
First step to check URL indexation in Google Search Console
Step 1: Paste page URL into the top bar in Google Search Console

Second step to check URL indexation in Google Search Console
Step 2: See if the URL is indexed or not

3. Ranking: Showing Up in Search Results

Ranking is the term we use to describe the process of determining the order in which indexed pages appear in search results. SEOs want to attract the correct traffic, convert that traffic to leads, and then convert leads to sales.

Ranking Tools and Techniques

  • Disallowing Pages from Being Crawled: Prevents unnecessary pages from being indexed, thus ranking.
  • Redirects: Directs users to other pages via JS or 3xx codes.
  • UX and Accessibility: Affects ranking through server configs, caches, page speed, core web vitals, mobile friendliness, SSL, and more.

Pages You May Not Want Ranked

  • Robots.txt: Disallows crawling of entire page types.
  • Meta-Robots Tags: Allows no-index directives on entire pages.
  • HTTP Status Codes: Service 4xx and 5xx means the page should not be ranked.

Note: There is so much beyond technical SEO that goes into ranking, like having helpful content, high quality content, and external signals like backlinks.

Recap: Steps Before Your Page Shows in Search

  1. Discover: Via links on the site, other sites, or XML sitemap.
  2. Crawl: Looks at the source HTML of the page.
  3. Index: If deemed helpful, adds the page to the growing collection.
  4. Rank: Determines the visibility in search results.

How Google Search Works, according to Google
How Google Says it Works. Source: Google

Google also renders content to try to see the page as a user would, but that's a topic for a different day.

The full discovery to indexation process that Google uses
Here's how Google crawls and indexes content, but if you start with the 3 steps we discuss in this article, you'll be off to a great start!

Beginner Technical SEO FAQs

What's the difference between crawling and indexing?

Crawling is the discovery phase, while indexing is the organization phase. Crawling finds the pages, and indexing adds them to the search engine's database.

How can I control which pages are crawled or indexed?

You can use tools like robots.txt, meta robots tags, and canonical tags to control crawling and indexing.

Why is my page not ranking well?

It could be due to issues with crawling, indexing, content quality, or other ranking factors like page speed, mobile-friendliness, and internal linking.

Technical SEO Cheat Sheet

Understanding and implementing the right techniques for crawling and indexing are table stakes for SEO. As promised, here's a cheat sheet to guide you through the process:

Crawling

To Discourage Crawling

1. Robots.txt directives:

  • This method is good for: Discouraging search engine bots from crawling specific pages or directories.
  • This method is not good for: Preventing access to sensitive data, as some bots might ignore the directive.
  • Other things to consider: Check that essential pages (like the homepage) are not accidentally blocked.

2. <a> attributes (e.g., nofollow):

  • This method is good for: Discouraging search engines from following specific links.
  • This method is not good for: Preventing a page from being indexed.
  • Other things to consider: The nofollow attribute does not guarantee that the linked page won't be indexed.

3. Avoid <a> tags (Use <button> or JS onclick):

  • This method is good for: Preventing search engines from recognizing links.
  • This method is not good for: User experience if not implemented correctly.
  • Other things to consider: Ensure that essential navigation remains accessible and user-friendly.

4. Exclude from XML sitemap:

  • This method is good for: Google can't crawl a URL if it doesn't know about it!
  • This method is not good for: Pages you want to be discovered and indexed.
  • Other things to consider: Regularly check your sitemap files to make sure that you're communicating the correct URLs.

To Prevent Crawling

1. Password Protection:

  • This method is good for: Securing sensitive content from unauthorized or unwanted access.
  • This method is not good for: Public pages that you want users and search engines to access freely.
  • Other things to consider: Ensure that the login mechanism is user-friendly and secure.

2. You can use 4xx Status Codes to prevent search engines and all people from seeing the page.

To Encourage Crawling

1. Include in XML sitemap:

  • This method is good for: Signaling search engines to crawl specific URLs.
  • This method is not good for: URLs that you don't want to be discovered.
  • Other things to consider: Ensure the sitemap is valid and submitted to search engines.

2. Link Building (Onsite & Offsite):

  • This method is good for: Improving the visibility and discoverability of a URL.
  • This method is not good for: URLs you don't want search engines to know about.
  • Other things to consider: Prioritize quality (relevance) over quantity when building links.

Indexing

To Discourage Indexing

Tip: Always try to prevent indexing at the crawling stage. If a URL cannot be crawled, it cannot be indexed!

1. Meta Robots Tag (noindex):

  • This method is good for: Instructing search engines not to index specific pages.
  • This method is not good for: Pages you want to appear in search results.
  • Other things to consider: Ensure that essential pages are not accidentally set to noindex.

2. Canonicalization:

  • This method is good for: Pointing search engines to the preferred version of a page.
  • This method is not good for: Pages with unique content that should be indexed separately.
  • Other things to consider: Ensure canonical tags are correctly implemented and reflect the page version you want indexed.

To Prevent Indexing

1.GSC URL Removal Tool:

  • This method is good for: Quickly removing a URL from Google's index.
  • This method is not good for: Permanent removal of a URL.
  • Other things to consider: The removal is temporary, and the URL may be reindexed in the future.

2. Password Protection (before indexing):

  • This method is good for: Preventing a page from being crawled, thus indexed.
  • This method is not good for: Public pages meant for user access.
  • Other things to consider: Ensure that the login mechanism doesn't hinder user access and is secure.

To Encourage Indexing

1. Canonicalization (for duplicates):

  • This method is good for: Consolidating ranking signals to a single, preferred URL.
  • This method is not good for: Unique pages that should be indexed separately.
  • Other things to consider: Avoid creating multiple pages with near-identical content.

2. Self-referencing Canonical Tag:

  • This method is good for: Reinforcing the canonical status of a URL.
  • This method is not good for: Pages that are duplicates of other content.
  • Other things to consider: Ensure the tag correctly references the current URL.

3. Internal & External Links:

  • This method is good for: Enhancing the visibility and authority of a URL.
  • This method is not good for: Over-optimization or spammy link-building practices.
  • Other things to consider: Focus on acquiring high-quality, relevant links.

4. Quality Content:

  • This method is good for: Attracting organic traffic and improving rankings.
  • This method is not good for: Thin or duplicated content.
  • Other things to consider: Regularly update content to keep it fresh and relevant. Don’t veer too far from your topic(s) of expertise.

Happy SEO-ing, everyone!

<div class="post-note-cute"><strong>Why you should listen to Momentic about technical SEO?</strong> We have been doing advanced technical SEO at scale since 2014. We are routinely considered a top technical SEO agency in the world by aggregators, such as Clutch.co (and our clients and connections!). If you have questions about anything SEO related, don't hesitate to reach out: info@momenticmarketing.com</div>

Bar chart showing increase over time with Momentic logo

About the author(s)