Duplicate content

Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly duplicate or closely similar.^[1] When multiple pages contain essentially the same content, search engines such as Google and Bing can penalize or cease displaying the copying site in any relevant search results.

Types

Non-malicious

Non-malicious duplicate content may include variations of the same page, such as versions optimized for normal HTML, mobile devices, or printer-friendliness, or store items that can be shown via multiple distinct URLs.^[1] Duplicate content issues can also arise when a site is accessible under multiple subdomains, such as with or without the "www." or where sites fail to handle the trailing slash of URLs correctly.^[2] Another common source of non-malicious duplicate content is pagination, in which content and/or corresponding comments are divided into separate pages.^[3]

Syndicated content is a popular form of duplicated content. If a site syndicates content from other sites, it is generally considered important to make sure that search engines can tell which version of the content is the original so that the original can get the benefits of more exposure through search engine results.^[1] Ways of doing this include having a rel=canonical tag on the syndicated page that points back to the original, NoIndexing the syndicated copy, or putting a link in the syndicated copy that leads back to the original article. If none of these solutions are implemented, the syndicated copy could be treated as the original and gain the benefits.^[4]

The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

There may be similar content between different web pages in the form of similar product content. This is usually noticed in e-commerce websites, where the usage of similar keywords for similar categories of products leads to this form of non-malicious duplicate content. This is often the case when new iterations and versions of products are released, but the seller or the e-commerce website mods do not the whole product descriptions.^[5]

Malicious

Malicious duplicate content refers to content that is intentionally duplicated in an effort to manipulate search results and gain more traffic. This is known as search spam. There is a number of tools available to verify the uniqueness of the content.^[6] In certain cases, search engines penalize websites' and individual offending pages' rankings in the search engine results pages (SERPs) for duplicate content considered “spammy.”

Detecting duplicate content

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.^[7]^[8]

Detection of plagiarism can be undertaken in a variety of ways. Human detection is the most traditional form of identifying plagiarism from written work. This can be a lengthy and time-consuming task for the reader^[8] and can also result in inconsistencies in how plagiarism is identified within an organization.^[9] Text-matching software (TMS), which is also referred to as "plagiarism detection software" or "anti-plagiarism" software, has become widely available, in the form of both commercially available products as well as open-source^{[examples needed]} software. TMS does not actually detect plagiarism per se, but instead finds specific passages of text in one document that match text in another document.

Resolutions

If the content has been copied, there are multiple resolutions available to both parties.^[10]

Get the content removed on the copier's site by contacting the owner of the duplicated content and requesting them to remove the copied content.
Hire an attorney to send a takedown notice to the copier.
Rewrite the content to make the site's content unique again.

A HTTP 301 redirect (301 Moved Permanently) is a method of dealing with duplicate content to redirect users and search engine crawlers to the single pertinent version of the content.^[1]

References

^ ^a ^b ^c ^d "Duplicate content". Google Inc. Retrieved 2016-01-07.
^ "Duplicate content - Duplicate Content". Retrieved 2011-12-19.
^ "Duplicate Content: Causation and Significance". Effective Business Growth. Retrieved 15 May 2017.
^ Enge, Eric (April 28, 2014). "Syndicated Content: Why, When & How". Search Engine Land. Third Door Media. Retrieved June 25, 2018.
^ Avoid Penalized By Google On Duplicate Content
^ Ahmad, Bilal (20 May 2011). "6 Free Duplicate Content Checker Tools". TechMaish.com. Retrieved 15 May 2017.
^ Culwin, Fintan; Lancaster, Thomas (2001). "Plagiarism, prevention, deterrence and detection". CiteSeerX 10.1.1.107.178. Archived from the original on 18 April 2021. Retrieved 2022-11-11 – via The Higher Education Academy.
^ ^a ^b Bretag, T., & Mahmud, S. (2009). A model for determining student plagiarism: Electronic detection and academic judgement. Journal of University Teaching & Learning Practice, 6(1). Retrieved from http://ro.uow.edu.au/jutlp/vol6/iss1/6
^ Macdonald, R., & Carroll, J. (2006). Plagiarism—a complex issue requiring a holistic institutional approach. Assessment & Evaluation in Higher Education, 31(2), 233–245. doi:10.1080/02602930500262536
^ "Have Duplicate Content? It Can Kill Your Rankings". OrangeFox.com. OrangeFox. Retrieved 27 March 2016.

[Google-1] "Duplicate content". Google Inc. Retrieved 2016-01-07.

[danclarkie.co.uk-2] "Duplicate content - Duplicate Content". Retrieved 2011-12-19.

[3] "Duplicate Content: Causation and Significance". Effective Business Growth. Retrieved 15 May 2017.

[4] Enge, Eric (April 28, 2014). "Syndicated Content: Why, When & How". Search Engine Land. Third Door Media. Retrieved June 25, 2018.

[5] Avoid Penalized By Google On Duplicate Content

[6] Ahmad, Bilal (20 May 2011). "6 Free Duplicate Content Checker Tools". TechMaish.com. Retrieved 15 May 2017.

[7] Culwin, Fintan; Lancaster, Thomas (2001). "Plagiarism, prevention, deterrence and detection". CiteSeerX 10.1.1.107.178. Archived from the original on 18 April 2021. Retrieved 2022-11-11 – via The Higher Education Academy.

[Content_similarity_detection_:0-8] Bretag, T., & Mahmud, S. (2009). A model for determining student plagiarism: Electronic detection and academic judgement. Journal of University Teaching & Learning Practice, 6(1). Retrieved from http://ro.uow.edu.au/jutlp/vol6/iss1/6

[9] Macdonald, R., & Carroll, J. (2006). Plagiarism—a complex issue requiring a holistic institutional approach. Assessment & Evaluation in Higher Education, 31(2), 233–245. doi:10.1080/02602930500262536

[10] "Have Duplicate Content? It Can Kill Your Rankings". OrangeFox.com. OrangeFox. Retrieved 27 March 2016.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]