published: 2008/05/10. tags: duplicate content, SEO, product catalogs
Duplicate Content - Reasons, Detection and Making It Unique
Is your king a clone?
Duplicate content is a big problem for search engines as it adds little value to index the same
content several times. This article is going to discuss the reasons for duplicate content, ways
to identify the duplicate content and also ways to make duplicate content hard to detect.
Reasons for duplicate content
Below are a few reasons for the duplicate content on the web
The same content could be provided in different formats. For example, a multi-page article
could have an option to read as a single page (the multi-page format is to get more page visits
which potentially increases ad revenue), or to print without ads, navigation menu and other
irrelevant page elements. In this type of duplicate content, it is not intentional by the websites
but to provide better service to their users.
Product details. Typically, the product details and technical specifications are provided
by the manufacturers and then all the retailers use that information on their websites (obviously,
retailers can't modify that information just to satisfy the needs of search engines).
RSS and Atom feeds provided the technology for content syndication and aggregation which resulted
in a lot of websites with duplicate content.
As part of search engine optimization, many webmasters have also started taking existing content
on the web and reusing it with a few modifications. I have even seen people posting on free lancing
websites for others to provide several articles on a specific topic by taking it from the web and
modifying it to make it unique!
Detecting Duplicate Content
The internet has billions of pages and with Google's recent effort to index the database driven
pages of popular websites, the number would only get bigger. Detecting duplicate is not easy and
perhaps not even accurately possible. The challenges of detecting duplicate content are given below
Combinatorial If an essay is written by three people, say A,
B and C, in order to find if anyone has copied from the other, one has to compare between A and B,
B and C and A and C. So, if there are N pages, the number of comparisons would be nC2
which is n*(n-1)/2 which is O(n2). However, it is likely that a well designed algorithm would
first remove duplicate web pages within a website and then try to identify duplicate content across websites.
Noise Websites have the main content blended into a web page with the rest of the html markup
meant for navigation, advertising and other site level elements. So, segregating the main content with
rest of the page is non-trivial.
Subset/Substring Even after identifying the main content, two pages may not have 100% matching
content. For example, when an article is broken into multiple pages and also an option to view it as a single
page, the single page content is a super-set of the content of each of the multiple pages. Comparing
two strings to see if they match each other is very simple, you just need to keep comparing the
characters of each string till they do not match. However, identifying if a string is a substring
of another is more involved. You need to start at each index of a string and keep comparing till
the other string is exhausted. So, the problem becomes O(l2) instead of just O(l) where
l is the number of characters in the string.
Markup Adding extra markup to a piece of content will not alter the meaning of the content.
It is likely that webmasters use the bold, italics and other markup tags to make the original content
altered a bit. So, a well designed alrogirthm should ignore the html tags and compare only the text content.
Normalization HTML treats multiple spaces as a single space unless the spaces are represented
using nbsp entity. Similarly, each character can be represented using it's ascii code. So, the text
has to be normalized before being compared.
Making Duplicate Content Unique
Altering duplicate content to make it harder to be detected is essentially making the above
detection steps harder. Also, some implementations may not take all the above factors into account.
So, incorporating measures against some of the above steps in detection logic could make it harder.
Augmenting the duplicate content with extra content. Most retailers try to provide a way for
their customers to comment on the products. While it serves the benefit of helping other potential
customers to use these comments in making the decision to buy the product, the retailers also benefit
through these comments as it makes the original product details and specification content more unique.
Altering the markup. Given that search engines place importance to bold,
strong, em and also
h1
... headers, altering
the markup by carefully picking the appropriate keywords with these type of keyword weight altering
markup helps visitors as well.
Replacing single space with multiple spaces and and similarly using the numeric
entity representation for some characters.
It may also be possible to do some natural language processing (NLP) and with the assistance of
a thesaurus alter the sentences but retain the meaning.
The intention of providing the techniques to thawrt duplicate content detection is not to encourage
such behavior but to help people realize what the sophisticated webmasters are doing. I have personally
used only the technique of altering the markup for product descriptions mainly to bring attention to
the salient features of a product to the user.