Friday, June 29, 2012

Difference Between HTML Sitemap and XML Sitemap

HTML Sitemap V/S XML Sitemap

An HTML sitemap allows site visitors to easily navigate a website. It is a bulleted outline text version of the site navigation. The anchor text displayed in the outline is linked to the page it references. Site visitors can go to the Sitemap to locate a topic they are unable to find by searching the site or navigating through the site menus.

This Sitemap can also be created in XML format and submitted to search engines so they can crawl the website in a more effective manner. Using the Sitemap, search engines become aware of every page on the site, including any URLs that are not discovered through the normal crawling process used by the engine. Sitemaps are helpful if a site has dynamic content, is new and does not have many links to it, or contains a lot of archived content that is not well-linked.

Which is better: an HTML site map or XML Sitemap? 



Thursday, June 28, 2012

How to Define Canonical in your Page



It sounds like an easy question, doesn’t it? While we hear a lot about duplicate content since the Panda update(s), I’m amazed at how many people are still confused by a much more fundamental question – which URL for any given page is the canonical URL? While the idea of a canonical URL is simple enough, finding it for a large, data-driven site isn’t always so easy. This post will guide you through the process with some common cases that I see every week.

Let’s Play Count the Pages

Before we dive in, let’s cover the biggest misunderstanding that people have about “pages” on their websites. When we think of a page, we often think of a physical file containing code (whether it’s static HTML or script, like a PHP file). To a crawler, a page is any unique URL that it finds. One file could theoretically generate thousands of unique URLs, and every one of those is potentially a “page” in Google’s eyes.
It’s easy to smile and nod and all agree that we understand, but let’s put it to the test. In each of the following scenarios, how many pages does Google see?

(A) “Static” Site

  • www.example.com/
  • www.example.com/store
  • www.example.com/about
  • www.example.com/contact

(B) PHP-based Site

  • www.example.com/index.php
  • www.example.com/store.php
  • www.example.com/about.php
  • www.example.com/contact.php

(C) Single-template Site

  • www.example.com/index.php?page=home
  • www.example.com/index.php?page=store
  • www.example.com/index.php?page=about
  • www.example.com/index.php?page=contact
The answer is (A) 4, (B) 4, and (C) 4. In Google’s eyes, it doesn’t matter whether the pages have extensions (“.php”), the home-page is at the root (“/”) or at index.php, or even if every page is being driven off of one physical template. There are four unique URLs, and that means there are four pages. If Google can crawl them all, they’ll all be indexed (usually).
Let’s dive right into a few examples. Please note: these are just examples. I’m not recommending any of the URL structures in this post as ideal – I’m just trying to help you determine the correct canonical URL for any given situation.

Case 1: Tracking URLs

I’ll start with an easy one. Many sites still use URL parameters to track visitor sessions or links from affiliates. No matter what the parameter is called or which purpose it’s used for, it creates a duplicate for every individual visitor or affiliate. Here are a few examples:
  1. www.example.com/store.php?session=1234
  2. www.example.com/store.php?affiliate=5678
  3. www.example.com/store.php?product=1234&affiliate=5678
In the first two examples, the session and affiliate ID create a copy, in essence, of the main store page. In both of these cases, the proper canonical URL is simply:
  • www.example.com/store.php
The last example is a bit trickier. There, we also have a “product=” parameter that drives the product being displayed. This parameter is essential – it determines the actual content of the page. So, only the “affiliate=” parameter should be stripped out, and the canonical URL is:
  • www.example.com/store.php?product=1234
This is just one of many cases where the canonical URL is NOT the root template or the URL with no parameters. Canonical URLs aren’t always short or pretty – many canonical URLs will have parameters. Again, I’m not arguing that this structure is ideal. I’m just saying that the canonical URL in this case would have to include the “product=” parameter.

Case 2: “Dynamic” URLs

Unfortunately, the word “dynamic” gets thrown around a little too freely – for the purposes of this blog post, I mean any URLs that pass variables to generate unique content. Those variables could look like traditional URL parameters or be embedded as “folders”.
A good example of the kind of URLs I’m talking about are blog post URLs. Take these four:
  1. www.example.com/blog/1234
  2. www.example.com/blog.php?id=1234
  3. www.example.com/blog.php?id=1234&comments=on
  4. www.example.com/blog/20120626
Again, it doesn’t matter whether the URLS have parameters or hide those parameters as virtual folders. All of these URLs use a unique value (either an ID or date) to generate a specific blog post. So what’s the canonical URL here? Obviously, if you canonicalize to “/blog”, you’re going to reduce your entire blog to one page. It’s a bit of a trick question, because the canonical URL could actually be something like this:
  • www.example.com/blog/this-is-a-blog-post
This is why we have such a hard time detecting the proper canonical URLs with automated tools – it really takes a deep knowledge of a site’s architecture and the builder’s intent. Don’t make assumptions based on the URL structure. You have to understand your architecture and crawl paths. If you just start stripping off URL parameters, you could cause an SEO disaster.

Case 3: The Home-page

It might seem strange to put the home page third, but the truth is that the first two cases were probably easier. Part of the problem is that home pages naturally spin out a lot of variations:
  1. www.example.com
  2. www.example.com/
  3. www.example.com/default.html
  4. www.example.com/index.php
  5. www.example.com/index.php?page=about
Add in complications like secure pages (https:), and you can end up multiplying all of these variants. While this is technically true of any page, the problem tends to be more common for the home page, since it’s usually the most linked-to page (both internally and from external sites) by a large margin.
In most cases, the technically correct home-page URL is:
  • http://www.example.com/
…but there are exceptions (such as if you secure your entire site). I don’t see the trailing slash (“/”) causing a ton of problems on home pages these days, since most browsers and crawlers add it automatically, but I think it’s still a best practice to use it.
Another common exception is if your site automatically redirects to another version of the home-page – ASP is notorious about this, and often lands visitors and bots at “index.aspx” or a similar page. While that situation isn’t ideal, you don’t want to cross signals. If the redirect is necessary, then the target of that redirect (i.e. the “index.aspx” URL) should be your canonical URL.
Finally, be very careful about situation #5 – in that case, as I discussed in the first section of this post, the “index.php” code template is actually driving other pages with unique content. Canonicalizing that to the root or to “index.php” could collapse your site to one page in the Google index. That particular scenario is rare these days, but some CMS systems still use it.

Case 4: Product Pages

In some ways, product pages are a lot like the blog-post pages in Case #2, except moreso. You can naturally end up with a lot of variations on an e-commerce site, including:
  1. www.example.com/store.php?id=1234
  2. www.example.com/store/1234
  3. www.example.com/store/this-is-a-product
  4. www.example.com/store.php?id=1234&currency=us
  5. www.example.com/store/1234/red
  6. www.example.com/store/1234/large
If you have a URL like #3, then that’s going to be your canonical URL for the product in most cases (especially #1-#3). If you don’t, then work up the list. In other words, if you have #3, use it; if not, use #2; if not #2, use #1. You have to work with the structure you have.
URLs #4-#6 are a bit trickier. Something like the currency selector in #4 can be very complicated and depends on how those selections are implemented (user selection vs. IP-based geo-location, for example). For Google’s purposes, you would typically want them to use the dominant price for the site’s audience and canonical to the main product URL (#1-#3, depending on the site architecture). Indexing every price variant, unless you have multiple domains, is just going to make your content look thinner.
With #5 and #6, the URL indicates a product variant, let’s say a T-shirt that comes in different colors and sizes. This situation depends a lot on the structure and scope of the content. Technically, your T-shirt in red/large is unique, and yet that page could look “thin” in Google’s eyes. If you have a variant or two for a handful of products, it’s no big deal. If every product has 50 possible combinations, then I think you need to seriously consider canonicalization.

Case 5: Search Pages

Now, the ugliest case of them all – internal search pages. This is a double-edged sword, since Google isn’t a fan of search-within-search (their results landing on your results) in general and these pages tend to spin out of control. Here are some examples:
  1. www.example.com/search.php?topic=1234
  2. www.example.com/search/this-is-a-topic
  3. www.example.com/topic
  4. www.example.com/search.php?topic=1234&page=2
  5. www.example.com/search.php?topic=1234&page=2&sort=desc
  6. www.example.com/search.php?topic=1234&page=2&filter=price
The list, unfortunately, could go on and on. While it’s natural to think that the canonical version should be #1-#3 (depending on your URL structure, just like in Case #4), the trouble is pagination. Pages 2 and beyond of your topic search may appear thin, in some cases, but they return unique results and aren’t technically duplicates. Google’s solutions have changed over time, and their advice can be frustrating, but they currently say to use the rel=prev/next tags. Put simply, these tags tell Google that the pages are part of a series.
In cases like #5-#6, Google recommends you use rel=prev/next for the pagination but then a canonical tag for the “&page=2” version (to collapse the sorts and filters). Implementing this properly is very complicated and well beyond the scope of this post, but the main point is that you should not canonicalize all of your search pages to page 1. Adam Audette has an excellent post on pagination that demonstrates just how tricky this topic is.

Know Your Crawl Paths

Finally, an important reminder – the most important canonical signal is usually your internal links. If you use the canonical tag to point to one version of a URL, but then every internal link uses a different version, you’re sending a mixed signal and using the tag as a band-aid. The canonical URL should actually be canonical in practice – use it consistently. If you’re an outside SEO coming into a new site, make sure you understand the crawl paths first, before you go and add a bunch of tags. Don’t create a mess on top of a mess.

Tuesday, June 26, 2012

Google Panda Algorithm Update 3.8 on 25 June 2012


Google Panda Update 3.8


Google has announced an update with the new Panda algorithm was pushed to a recent day. According to the message from Google in Twitter update "will be noticeable affects only ~ 1% of queries worldwide."
Google Panda Update 3.8So over the weekend negotiations an update, but search giant announced the rollout was officially introduced on 25 June. And there are no updates or changes to the algorithm as it is just a basic data refresh.

The last update was pushed out Panda on 8 June and earlier 26 april. Short, Google does Panda Penguin algorithms and updates approximately every month. But the recent update of Panda took about 2 weeks ago. No doubt, Google'd a few new ideas and wanted to push a new refresh.

Anyway, the "fairness warriors" against plagiarism happy to hear news. The new Panda update is another issue for webmasters force them to create unique and quality content. But webmasters are afraid that the new algorithm will have negative impact on their websites. Perhaps some would respect Google as the search giant more used things like Panda to his own pages. What do you think of it?

Tuesday, June 19, 2012

5 Tips for Create SEO Friendly URL

Every SEO eventually gets fixated on a tactic. Maybe you read 100 blog posts about how to build the “perfectly” optimized URL, and you keep tweaking and tweaking until you get it just right. Fast-forward 2 months – you’re sitting on 17 layers of 301-redirects, you haven’t done any link-building, you haven’t written any content, you’re eating taco shells with mayonnaise for lunch, and your cat is dead.

seo friendly urlOk, maybe that’s a bit extreme. I do see a lot of questions about the "ideal" URL structure in Q&A, though. Most of them boil down to going from pretty good URLs to slightly more pretty good URLs.

All Change Is Risky

I know it’s not what the motivational speakers want you to hear, but in the real world, change carries risk. Even a perfectly executed site-wide URL change – with pristine 301-redirects – is going to take time for Google to process. During that time, your rankings may bounce. You may get some errors. If your new URL scheme isn’t universally better than the old one, some pages may permanently lose ranking. There’s no good way to A/B test a site-wide SEO change.

More often, it’s just a case of diminishing returns. Going from pretty good to pretty gooder probably isn’t worth the time and effort, let alone the risk. So, when should you change your URLs? I’m going to dive into 5 specific scenarios to help you answer that question…

(1) Dynamic URLs

A dynamic URL creates content from code and data and carries parameters, like this:

www.example.com/product.php?id=12345&color=4&size=3&session=67890

It’s a common SEO misconception that Google can’t read these URLs or gets cut off after 2 or 3 parameters. In 2011, that’s just not true – although there are reasonable limits on URL length. The real problems with dynamic URLs are usually more complex:

  • They don’t contain relevant keywords.
  • They’re more prone to creating duplicate content.
  • They tend to be less user-friendly (lower click-through).
  • They tend to be longer.

So, when are your URLs too dynamic? The example above definitely needs help. It’s long, it has no relevant keywords, the color and size parameters are likely creating tons of near-duplicates, and the session ID is creating virtually unlimited true duplicates. If you don’t want to be mauled by Panda, it’s time for a change.

In other cases, though, it’s not so simple. What if you have a blog post URL like this?

www.example.com/blog.php?topic=how-to-tame-a-panda

It’s technically a “dynamic” URL, so should you change it to something like:

www.example.com/blog/how-to-tame-a-panda

I doubt you’d see much SEO benefit, or that the rewards would outweigh the risks. In a perfect world, the second URL is better, and if I was starting a blog from scratch I’d choose that one, no question. On an established site with 1000s of pages, though, I’d probably sit tight.

(2) Unstructured URLs

Another common worry people have is that their URLs don’t match their site structure. For example, they have a URL like this one:

www.example.com/diamond-studded-ponies

...and they think they should add folders to represent their site architecture, like:

www.example.com/horses/bejeweled/diamond-studded-ponies

There’s a false belief in play here – people often think that URL structure signals site structure. Just because your URL is 3 levels deep doesn’t mean the crawlers will treat the page as being 3 levels deep. If the first URL is 6 steps from the home-page and the second URL is 1 step away, the second URL is going to get a lot more internal link-juice (all else being equal).

You could argue that the second URL carries more meaning for visitors, but, unfortunately, it’s also longer, and the most unique keywords are pushed to the end. In most cases, I’d lean toward the first version.

Of course, the reverse also applies. Just because a URL structure is “flat” and every page is one level deep, that doesn’t mean that you’ve created a flat site architecture. Google still has to crawl your pages through the paths you’ve built. The flatter URL may have some minor advantages, but it’s not going to change the way that link-juice flows through your site.

Structural URLs can also create duplicate content problems. Let’s say that you allow visitors to reach the same page via 3 different paths:

www.example.com/horses/bejeweled/diamond-studded-ponies

www.example.com/tags/ponies/diamond-studded-ponies

www.example.com/tags/shiny/diamond-studded-ponies

Now, you’ve created 2 pieces of duplicate content – Google is going to see 3 pages that look exactly the same. This is more of a crawl issue than a URL issue, and there are ways to control how these URLs get indexed, but an overly structured URL can exacerbate these problems.

(3) Long URLs

How long of a URL is too long? Technically, a URL should be able to be as long as it needs to be. Some browsers and servers may have limits, but those limits are well beyond anything we’d consider sane by SEO or usability standards. For example, IE8 can support a URL of up to 2,083 characters.

Practically speaking, though, long URLs can run into trouble. Very long URLs:

  • Dilute the ranking power of any given URL keyword
  • May hurt usability and click-through rates
  • May get cut off when people copy-and-paste
  • May get cut off by social media applications
  • Are a lot harder to remember

How long is too long is a bit more art than science. One of the key issues, in my mind, is redundancy. Good URLs are like good copy – if there’s something that adds no meaning, you should probably lose it. For example, here’s a URL with a lot of redundancy:

www.example.com/store/products/featured-products/product-tasty-tasty-waffles

If you have a “/store” subfolder, do you also need a “/products” layer? If we know you’re in the store/products layer, does your category have to be tagged as “featured-products” (why not just “featured”)? Is the “featured” layer necessary at all? Does each product have to also be tagged with “product-“? Are the waffles so tasty you need to say it twice?

In reality, I’ve seen much longer and even more redundant URLs, but that example represents some of the most common problems. Again, you have to consider the trade-offs. Fixing a URL like that one will probably have SEO benefits. Stripping “/blog” out of all your blog post URLs might be a nice-to-have, but it isn’t going to make much practical difference.

(4) Keyword Stuffing

Scenarios (3)-(5) have a bit of overlap. Keyword-stuffed URLs also tend to be long and may cannibalize other pages. Typically, though a keyword-stuffed URL has either a lot of repetition or tries to tackle every variant of the target phrase. For example:

www.example.com/ponies/diamond-studded-ponies-diamond-ponies-pony

It’s pretty rare to see a penalty based solely on keyword-stuffed URLs, but usually, if your URLs are spammy, it’s a telltale sign that your title tags, <h1>’s, copy, etc. are spammy. Even if Google doesn’t slap you around a little, it’s just a matter of focus. If you target the same phrase 14 different ways, you may get more coverage, but each phrase will also get less attention. Prioritize and focus – not just with URLs, but all keyword targeting. If you throw everything at the wall to see what sticks, you usually just end up with a dirty wall.

(5) Keyword Cannibalization

This is probably the toughest problem to spot, as it happens over an entire site – you can’t spot it in a single URL (and, practically speaking, it’s not just a URL problem). Keyword cannibalization results when you try to target the same keywords with too many URLs.

There’s no one right answer to this problem, as any site with a strong focus is naturally going to have pages and URLs with overlapping keywords. That’s perfectly reasonable. Where you get into trouble is splitting off pages into a lot of sub-pages just to sweep up every long-tail variant. Once you carry that too far, without the unique content to support it, you’re going to start to dilute your index and make your site look “thin”.

The URLs here are almost always just a symptom of a broader disease. Ultimately, if you’ve gotten too ambitious with your scope, you’re going to need to consolidate those pages, not just change a few URLs. This is even more important post-Panda. It used to be that thin content would only impact that content – at worst, it might get ignored. Now, thin content can jeopardize the rankings of your entire site.

Proceed With Caution

If you do decide a sitewide URL change is worth the risk, plan and execute it carefully. How to implement a sitewide URL change is beyond the scope of this post, but keep in mind a couple of high-level points:

  1. Use proper 301-redirects.
  2. Redirect URL-to-URL, for every page you want to keep.
  3. Update all on-page links.
  4. Don’t chain redirects, if you can avoid it.
  5. Add a new XML sitemap.
  6. Leave the old sitemap up temporarily.

Point (3) bears repeating. More than once, I’ve seen someone make a sitewide technical SEO change, implement perfect 301 redirects, but then not update all of their navigation. Your crawl paths are still the most important signal to the spiders – make sure you’re 100% internally consistent with the new URLs.

That last point (6) is a bit counterintuitive, but I know a number of SEOs who insist on it. The problem is simple – if crawlers stop seeing the old URLs, they might not crawl them to process the 301-redirects. Eventually, they’ll discover the new URLs, but it might take longer. By leaving the old sitemap up temporarily, you encourage crawlers to process the redirects. If those 301-redirects are working, this won’t create duplicate content. Usually, you can remove the old sitemap after a few weeks.

Even done properly and for the right reasons, measure carefully and expect some rankings bounce over the first couple of weeks. Sometimes, Google just needs time to evaluate the new structure