Duplicate Content Algorithm

published: 2008/05/10. tags: duplicate content, SEO, product catalogs

Duplicate Content - Reasons, Detection and Making It Unique

Is your king a clone?

Duplicate content is a big problem for search engines as it adds little value to index the same content several times. This article is going to discuss the reasons for duplicate content, ways to identify the duplicate content and also ways to make duplicate content hard to detect.

Reasons for duplicate content

Below are a few reasons for the duplicate content on the web

The same content could be provided in different formats. For example, a multi-page article could have an option to read as a single page (the multi-page format is to get more page visits which potentially increases ad revenue), or to print without ads, navigation menu and other irrelevant page elements. In this type of duplicate content, it is not intentional by the websites but to provide better service to their users.
Product details. Typically, the product details and technical specifications are provided by the manufacturers and then all the retailers use that information on their websites (obviously, retailers can't modify that information just to satisfy the needs of search engines).
RSS and Atom feeds provided the technology for content syndication and aggregation which resulted in a lot of websites with duplicate content.
As part of search engine optimization, many webmasters have also started taking existing content on the web and reusing it with a few modifications. I have even seen people posting on free lancing websites for others to provide several articles on a specific topic by taking it from the web and modifying it to make it unique!

Detecting Duplicate Content

The internet has billions of pages and with Google's recent effort to index the database driven pages of popular websites, the number would only get bigger. Detecting duplicate is not easy and perhaps not even accurately possible. The challenges of detecting duplicate content are given below

Combinatorial If an essay is written by three people, say A, B and C, in order to find if anyone has copied from the other, one has to compare between A and B, B and C and A and C. So, if there are N pages, the number of comparisons would be ⁿC₂ which is n*(n-1)/2 which is O(n²). However, it is likely that a well designed algorithm would first remove duplicate web pages within a website and then try to identify duplicate content across websites.
Noise Websites have the main content blended into a web page with the rest of the html markup meant for navigation, advertising and other site level elements. So, segregating the main content with rest of the page is non-trivial.
Subset/Substring Even after identifying the main content, two pages may not have 100% matching content. For example, when an article is broken into multiple pages and also an option to view it as a single page, the single page content is a super-set of the content of each of the multiple pages. Comparing two strings to see if they match each other is very simple, you just need to keep comparing the characters of each string till they do not match. However, identifying if a string is a substring of another is more involved. You need to start at each index of a string and keep comparing till the other string is exhausted. So, the problem becomes O(l²) instead of just O(l) where l is the number of characters in the string.
Markup Adding extra markup to a piece of content will not alter the meaning of the content. It is likely that webmasters use the bold, italics and other markup tags to make the original content altered a bit. So, a well designed alrogirthm should ignore the html tags and compare only the text content.
Normalization HTML treats multiple spaces as a single space unless the spaces are represented using nbsp entity. Similarly, each character can be represented using it's ascii code. So, the text has to be normalized before being compared.

Making Duplicate Content Unique

Altering duplicate content to make it harder to be detected is essentially making the above detection steps harder. Also, some implementations may not take all the above factors into account. So, incorporating measures against some of the above steps in detection logic could make it harder.

Augmenting the duplicate content with extra content. Most retailers try to provide a way for their customers to comment on the products. While it serves the benefit of helping other potential customers to use these comments in making the decision to buy the product, the retailers also benefit through these comments as it makes the original product details and specification content more unique.
Altering the markup. Given that search engines place importance to bold, strong, em and also
h1
... headers, altering the markup by carefully picking the appropriate keywords with these type of keyword weight altering markup helps visitors as well.
Replacing single space with multiple spaces and   and similarly using the numeric entity representation for some characters.
It may also be possible to do some natural language processing (NLP) and with the assistance of a thesaurus alter the sentences but retain the meaning.

The intention of providing the techniques to thawrt duplicate content detection is not to encourage such behavior but to help people realize what the sophisticated webmasters are doing. I have personally used only the technique of altering the markup for product descriptions mainly to bring attention to the salient features of a product to the user.

Duplicate Content - Reasons, Detection and Making It Unique

Reasons for duplicate content

Detecting Duplicate Content

Making Duplicate Content Unique

h1