Google recently have been working hard on developing tools and new methods to combat duplicate content. 301′s remain the best fix and prevention is best done with robots.txt blocks and nofollow, but the new tools are great if you can’t get access to redirect or block.

duplicate-cat

Duplicate content causes a split of page rank, can cause some pages to be filtered from rankings but big websites seem to not care about the issue. If the big site’s don’t care about it, why should a site for a small business concentrate on often making lengthy changes or spend time on re-directs that some sites don’t even bat an eyelid at the issue?

Let’s have a look at some examples.

BBC

On the whole the SEO on the BBC is good but they do have a duplicate content issue on the site. I first pointed this out in a post back in February. The problem seen was two URL’s for each page.

http://news.bbc.co.uk/sport1/hi/football/teams/a/arsenal/7831046.stm

You also have a second URL, the difference it’s in the folder sport2 and not sport1

http://news.bbc.co.uk/sport2/hi/football/teams/a/arsenal/7831046.stm

On top of that there is also the low graphic version of the page.

http://news.bbc.co.uk/sport1/low/football/teams/b/blackpool/7831046.stm

And under the sport2 folder

http://news.bbc.co.uk/sport2/low/football/teams/b/blackpool/7831046.stm

Facebook

Another post I did a while ago where your profile can be loaded up on two URL’s

http://www.facebook.com/johnpcampbell

and

http://en-gb.facebook.com/johnpcampbell

Also some profiles now appearing with ?_fb_noscript=1 after the URL’s. That example above isn’t indexed but these two are http://www.facebook.com/wgardner69 and http://en-gb.facebook.com/wgardner69?_fb_noscript=1 some random person!

LinkedIn

Spotted by a work colleague of mine Neil Walker (follow him on twitter @theukseo) he noticed LinkedIn had a duplication problem with two URL’s for his profile.

http://www.linkedin.com/pub/neil-walker/4/41a/793

http://www.linkedin.com/in/internetmarketingoptimisation

Travel Supermarket & Virgin Media

Another spot form Neil was a very strange duplication on Travel Supermarket & Virgin Media. This time it looked like they have duplicated content on a sub domain rather than having two URL’s for one page of content.

Twitter

Can’t remember who spotted this (please comment and I’ll link) but twitter has a https duplication problem and a mobile sub domain duplicating.

m.twitter.com/johnpcampbell

twitter.com/johnpcampbell

https://twitter.com/johnpcampbell

Looking today there is also explore.twitter.com/johnpcampbell indexed but they have a fix in place in the form of a 301 re-direct to twitter.com/johnpcampbell

Should you still care about duplicate content?

In all these example due to the size and the power of the sites it’s not really having an adverse effect on their overall performance (like throwing a dart at godzilla! he’s not going to feel a thing). Google seems to be able to work out which is the correct URL to display. It would be nice to know the effects of correcting this as these sites have so many pages.

Just a little fix to stop duplicate content on twitter would cut the crawling time of Google allowing the search engine to spider more pages. Unfortunately we’ll never know but I’ll keep on fixing site-wide duplicate content issues.

Do you thing big companies need to sort out duplicate content issues? Add a comment

Something that I spend a lot of time doing at work is finding content management and e commerce platforms that create duplicate content. It’s generally generated from print views, pdf’s, differences in URL generation or clients ripping off other sites.

It can lead to big problems if it effects your entire site. Often getting these problems sorted out can lead to a good increase in long tail traffic by 10 to 20 percent. I noticed in the past that the BBC site does spurt out duplicate pages. For example.

Andrei Arshavin signed today for Arsenal, the BBC have a nice article with a video at the top. The URL for that page is:

http://news.bbc.co.uk/sport1/hi/football/teams/a/arsenal/7831046.stm

You also have a second URL, the difference it’s in the folder sport2 and not sport1

http://news.bbc.co.uk/sport2/hi/football/teams/a/arsenal/7831046.stm

That URL 302 re-directs to the 1st one, but Google does cache the second URL. This can be seen at http://tinyurl.com/bdsj3q.

Thats not it, soon the low graphics version will get cached.

http://news.bbc.co.uk/sport1/low/football/teams/b/blackpool/7831046.stm

And under the sport2 folder

http://news.bbc.co.uk/sport2/low/football/teams/b/blackpool/7831046.stm

So it’s duplicate content, Google says that you should try to make sure that you only have one version of a page on your site. The reasons why you should sort this out are pretty simple.

- Splits the flow of link juice
- Splits the possibility of inbound links
- Search engines spend time caching pages it’s already seen rather than picking up your new pages

So should the BBC block Google from caching the duplicates? Well, yes, but for a site of that site and the speed that Google caches the content and the inbound links that generate it’s not going to cause a problem.

If you see similar problems in new sites then you do need to get fixes in place. 301 the pages that have been cached already and then block the search engines!

Posted in SEO.