Something that I spend a lot of time doing at work is finding content management and e commerce platforms that create duplicate content. It’s generally generated from print views, pdf’s, differences in URL generation or clients ripping off other sites.
It can lead to big problems if it effects your entire site. Often getting these problems sorted out can lead to a good increase in long tail traffic by 10 to 20 percent. I noticed in the past that the BBC site does spurt out duplicate pages. For example.
Andrei Arshavin signed today for Arsenal, the BBC have a nice article with a video at the top. The URL for that page is:
You also have a second URL, the difference it’s in the folder sport2 and not sport1
That URL 302 re-directs to the 1st one, but Google does cache the second URL. This can be seen at http://tinyurl.com/bdsj3q.
Thats not it, soon the low graphics version will get cached.
And under the sport2 folder
So it’s duplicate content, Google says that you should try to make sure that you only have one version of a page on your site. The reasons why you should sort this out are pretty simple.
- Splits the flow of link juice
- Splits the possibility of inbound links
- Search engines spend time caching pages it’s already seen rather than picking up your new pages
So should the BBC block Google from caching the duplicates? Well, yes, but for a site of that site and the speed that Google caches the content and the inbound links that generate it’s not going to cause a problem.
If you see similar problems in new sites then you do need to get fixes in place. 301 the pages that have been cached already and then block the search engines!