Thursday, July 21, 2011

XML Sitemap Guide

Google, Yahoo!, and Microsoft all support a protocol known as XML Sitemaps. Google first announced it in 2005, and then Yahoo! and Microsoft agreed to support the protocol in 2006. Using the Sitemaps protocol you can supply the search engines with a list of all the URLs you would like them to crawl and index.

Adding a URL to a Sitemap file does not guarantee that a URL will be crawled or indexed. However, it can result in pages that are not otherwise discovered or indexed by the search engine getting crawled and indexed. In addition, Sitemaps appear to help pages that have been relegated to Google’s supplemental index make their way into the main index.

This program is a complement to, not a replacement for, the search engines’ normal, link-based
crawl. The benefits of Sitemaps include the following:
  • For the pages the search engines already know about through their regular spidering, they use the metadata you supply, such as the last date the content was modified (lastmod date) and the frequency at which the page is changed (changefreq), to improve how they crawl your site.
  • For the pages they don’t know about, they use the additional URLs you supply to increase their crawl coverage.
  • For URLs that may have duplicates, the engines can use the XML Sitemaps data to help choose a canonical version.
  • Verification/registration of XML Sitemaps may indicate positive trust/authority signals.
  • The crawling/inclusion benefits of Sitemaps may have second-order positive effects, such as improved rankings or greater internal link popularity.

The Google engineer who in online forums goes by GoogleGuy (a.k.a. Matt Cutts, the head of Google’s webspam team) has explained Google Sitemaps in the following way: Imagine if you have pages A, B, and C on your site. We find pages A and B through our normal web crawl of your links. Then you build a Sitemap and list the pages B and C. Now there’s a chance (but not a promise) that we’ll crawl page C. We won’t drop page A just because you didn’t list it in your Sitemap. And just because you listed a page that we didn’t know about doesn’t guarantee that we’ll crawl it. But if for some reason we didn’t see any links to C, or
maybe we knew about page C but the URL was rejected for having too many parameters or some other reason, now there’s a chance that we’ll crawl that page C.

Sitemaps use a simple XML format that you can learn about at http://www.sitemaps.org. XML Sitemaps are a useful and in some cases essential tool for your website. In particular, if you have reason to believe that the site is not fully indexed, an XML Sitemap can help you increase the number of indexed pages. As sites grow in size, the value of XML Sitemap files tends to increase dramatically, as additional traffic flows to the newly included URLs.

No comments:

Post a Comment