XML Sitemap: key recommendations for optimization

March 26, 2021
Share on facebook
Share on twitter
Share on linkedin
Share on email

Try JumpStory for free

Get access to a free 14-day trial today.

The Sitemap.xml on your site can act as good navigation for the pages you want Google bot to index. It helps you find your main pages faster, even if you don’t have a good internal linking.

In this article, we will present various recommendations for the optimization of XML Sitemap and why it is good to do this.

Functionalities and advantages

XML sitemap

Make it easier for bots to work and allow the possibility to get “reports” for pages and links on your site that could not be easily found.

Some of the SEO benefits are as follows:

  • faster indexing – search engines will find new pages much faster, so the process of indexing and displaying the website in search results will be faster. The peculiar thing here is that it can also help you with deindexing (more information here);
  • better indexing of internal pages – search engines can find pages that were not found when crawling the website. But this does not necessarily mean that they will all be indexed.
  • monitoring of indexed pages. In combination with the Google Search Console, you can find out which URLs are covered in the XML Sitemap that Google indexes.

Is an XML Sitemap important?

It is important for sites that:

  • do not have a good structure or do not have a good distribution of internal links;

Good and bad internal linking

  • have many pages – XML sitemap helps search engines find pages that are new or updated;

Infographic search engine

  • don’t have many inbound links – this will be a great way to find your pages.

Links infographic

Requirements and formats

Google supports several Sitemap formats. All formats and standards can be found at this address: https://www.sitemaps.org/index.html.

All formats limit the sitemap to 50MB (uncompressed) and 50,000 addresses. If you have a larger file or more addresses, you will need to create an index file with all the maps (described in the article below).

The main recommendations are:

  • the file must be encoded with UTF-8;
  • it must start with an open tag and end with a close tag such as …. ;
  • specify the standard protocol in the tag;
  • main tag for each URL entry ;
  • specify the URL starting with the protocol (https or http) in the tag, which must participate in the main tag for saving.

Additional optional attributes for XML sitemaps

Google does not use the attribute on its sites. All the other attributes are available, but it depends on whether they will be reflected. Therefore, keep in mind that Google doesn’t take these tags very seriously. They are:

  • – represents the date of the last file change. Must be in W3C Datetime format;
  • – how often the page is likely to be updated. This value provides general information about search engines. Valid values can be always, hourly, daily, weekly, monthly, yearly, never.

                  It should be kept in mind that the value of this tag is considered more as a hint rather than a command. Robots see this information and take it into account, but ultimately decide for themselves whether to use it, depending on many other factors.

  • – Prioritizes the URL over other URLs on your site. Valid values range from 0.0. to 1.0.

                  Here again, it should be kept in mind that this priority is relative and is not a mandatory condition for robots, or at least not yet accepted as such. However, if you decide to give it a try, use the following guide:

    • 0 – 0.3: Outdated news, information that is no longer valid, but is historically useful;
    • 4 – 0.7: Blog articles, page categories, frequently asked questions;
    • 8 – 1.0: Home page, product pages, all pages with well-optimized content.

The following example shows a Sitemap that contains only one URL and uses all the optional tags that are written in italic.

 https://netpeak.bg

 2018-09-15

 monthly

 0.8

Identifying the important pages

Add high-quality pages and those that are well optimized. Overall quality is of great importance for better ranking. This is a serious factor for Google that can give you a serious priority over the competition.

We don’t want to visit low-quality pages, neither do Google bots. If you guide it to thousands of pages that are not useful to users and are not well optimized, this can be only in harm for you. What are high-quality pages? Simply put, those are pages that:

  • have sufficient unique content;
  • quickly engage their users by prompting action (comments, reviews, etc.);
  • include images, videos, etc .;
  • do not violate Google policies;

Pages open for indexing

The crawling budget generally represents the number of pages crawled per unit of time (day, week, month, etc.). Therefore, it is not advisable to waste it unnecessarily.

Pages that contain the “Noindex” meta tag should not be added to the sitemap. to follow a logical order it’s important for everything.

It is necessary to make an automated check and not to include addresses that are closed for indexing.

It is recommended to follow these instructions:

  • If the page https://example.com/category/product has a meta tag “noindex”, it should not be included in the XML map of the site;

non index screenshot

  • When the page is closed for indexing via robots.txt, it should not be included in the XML map:

Disallow: /category/product

Noindex: /category/product

  • If the page is closed for indexing via X-Robots-Tag in the HTTP header, it should also not be included in the XML map of the site:

HTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT

(…)

X-Robots-Tag: noindex

(…)

Canonical versions of the pages

Access to a single page through multiple URLs with similar content will be considered duplicated by Google.

You must use the “link rel canonical” attribute to instruct the bot which is the “main” page and which should be crawled and indexed.

canonical version of a website infographic

For example, if the page https://example.com/category/product-1 has canonical to https://example.com/product, then http://example.com/category/product-1 should not participate in the XML sitemap.

You should perform an automated check as the automation of processes will surely bring you fewer headaches and save you time for manual inspections.

Pages that return 200 OK

Include addresses that return a 200 OK response. It is important to make automated checks and not to include addresses that return a response other than 200 OK – for example 404, 301, etc.

For example, if the page https://example.com/product returns a response different than 200 OK, then it should not participate in the sitemap.

HTTP response header check

You can use the following tool for checking: https://soft.galinov.com/ to check.

Pages from pagination

It is not necessary to include absolutely all pages in sitemap.xml. The bot is smart enough to be able to navigate from the first page in the relevant category if it is described properly. It is recommended to do the following:

  • include only the main pages of the categories;
  • mark the pages with rel = next / rel = prev so that the robot can see the connection between them;
  • each page of the pagination should have canonical guiding to itself, not to the main page, because if it’s the other way around, it will mean you are telling the bot “It doesn’t matter that I have 5,000 products and 20 pages, they are the same as the first one.”

For example, the page https://example.com/category/page-2 should not participate in the map. Here you can find the official opinion of Google, as well as their recommendations:

Minimize the file size

Google and Bing increased file sizes from 10MB to 50MB in 2016, but it’s still a good practice to keep your Sitemap as small as possible.

Bing and google sitemaps

Of course, it’s not something to worry about, but if your sitemap contains more than 50,000 URLs or exceeds 50MB in size, it should be broken down into more XML maps. In this case, the references to all XML maps should be described in a separate sitemap index file.

What is a XML Sitemap Index File

Sitemap index file infographic

You can submit multiple Sitemap files, but each file must comply with the above rules. If you want, you can compress the files using gzip to reduce their size according to the requirements.

The XML format of the index file is very similar to the normal sitemap format. It must contain:

  • open and close tag as ;
  • an entry for each Sitemap with the main XML attribute being ;
  • tag to the main attribute.

The recommended attribute is also included.

Note: The Sitemap index file can only list maps that are on the same site. For example:

https://example.com/sitemap_index.xml may include maps at https://example.com, but not at https://www.saitprimer.com or https://www.example.com

As with all other files, the index file must be encoded with UTF-8.

The following example shows a Sitemap index that lists two maps:

     http://www.example.com/sitemap1.xml.gz

     2018-10-01T18:23:17+00:00

     http://www.example.com/sitemap2.xml.gz

     2017-01-01

Description of the mobile version

We need to help the Google bot to find our content and understand the connection between the desktop and mobile pages. In the XML sitemap must be added the rel = “alternate” attribute for the desktop version pages, as follows:  

xmlns:xhtml=”http://www.w3.org/1999/xhtml”>

http://www.example.com/page-1/

<xhtml:link

rel=”alternate”

media=”only screen and (max-width: 640px)”

href=”http://m.example.com/page-1″ />

Keep in mind that each desktop page needs to correspond to one page of the mobile version. It is not recommended, for example, for several desktop pages to be linked via rel = “alternate” to one page of the mobile version and vice versa.

You must also check for redirects. It is important that the desktop page corresponds to the same content in the mobile version, and not redirecting to another. Additional information here.

mobile redirect infographic

How bots can find your XML Sitemap

When you have finished all the automation of the process and uploaded it to your server (or generated it by a plugin), you need to leave a clue where bots can find it.

The best way is to include a link to it in your robots.txt file. This is also called Sitemap Discovery and it’s something that Google, Bing, and Yahoo introduced back in 2007 to help their robots find XML Sitemaps.

All you have to do is include the full path to your map or index file.

full path index file screenshot

Correct transliteration of addresses

The official Google documentation (Build and submit a Sitemap) emphasizes that all data values (including URLs) must contain only ASCII characters. It cannot contain control codes or special characters such as * or {}.

If your site’s URL contains these characters, you’ll get an error when you try to add it.

Submit your map to Google
You can submit your sitemap to Google via Google Search Console.

google search console screenshot

Check for any errors before submitting. It’s important to clear up any errors that may be an obstacle to indexing key landing pages.

Ideally, the number of indexed pages should be equal to the number of pages submitted.

Conclusion

  1. Be consistent – if the page is blocked by robots.txt or by “noindex”, it is better for it not to be in your XML map.
  2. Automate your process – all of the above recommendations should be available for automation, as this will save you time, help the Crawling budget to stay optimized, and also save you a lot of headaches.
  3. If you have a very large site, use an index file with different maps which will save you server time and will cover all the important pages on your site.

About the author

Martin Zhelyazkov is an SEO specialist in Performance Marketing Agency Netpeak. His various years of experience in the field of SEO are mainly focused on online stores, web portals, and corporate sites. 

Netpeak Agency is one of the largest Performance Marketing Agencies in Central and Eastern Europe, part of the Netpeak Group, with more than 3000+ successful projects. 

Martin Zhelyazkov

Martin Zhelyazkov