Scalable Data Deduplication Using Similarity Matching And In Memory Indexes

This article describes an approach to mitigate this problem and provide a scalable high-throughput deduplication approach. The approach uses data segmentation and chunking coupled with MinHash based Similarity Matching to provide a high-performance mechanism to locate zones (or segments) of similar data and perform further chunk hash based identity deduplication within those zones.

Publication Date
16 April 2013

data de-duplication


Click here to download link on

Click to share this page via your favorite social network.

Learn more about defensive publications with our examples and frequently asked questions

What we are trying to do?

We are attempting to mobilize the creativity and innovative capacities of the Linux and broader open source community to codify the universe of preexisting inventions in defensive publications that upon publication in the IP.COM database will immediately serve as effective prior art that prevents anyone from having a patent issued that claims inventions that have already been document in a defensive publication. In addition to creating a vehicle to utilize this highly effective form of IP rights management for known inventions, it is hoped that the community will use defensive publications as a means of codifying future inventions should the inventors prefer not to make their invention the subject of a patent disclosure and application.