OpenLSH is an open source platform that implements Locality Sensitive Hashing.
To familiarize yourself with LSH, please visit the demo site.
To try it, you will need a Gmail-based account and a Twitter account.
That is the main idea of LSH: the incoming "Documents" — tweets in this case — are assigned to "buckets". This can be done as documents come in from the source, as implemented in the demo. The resulting buckets can be examined and identical or similar documents just fall into the same buckets. Each document will fall into multiple buckets. We examine buckets that have multiple documents in them and report on the ones that have a high degree of Jaccard similarity of their 4-gram character shingles.
Central to OpenLSH is a sparse matrix data structure
It has as many rows as there are documents and as many columns as there are buckets (up to 232 buckets in this implementation).
A related data structure
MatrixRow represents a document.
A line format class specifies how to parse incoming documents.
The line format class should specify a static method
parse which returns a document ID and the text to be used for classification.
PeerbeltLine in the code base as examples.
You may follow the implementation of PeerbeltLine as a template for parsing documents in your implementation —
the original text is HTML and we strip out content inside
<style> tags and also remove other HTML tags before assigning documents to buckets.
In this case, documents that have the same text but differ only in their markup are considered to be identical.
A database specifies how to interact with the database.
settings.py specifies which database is being used.
Implementations are available for Google App Engine Datastore and for Cassandra.
An in-memory database based on Python data structures is also available and it is handy for testing.
To start, replicate the code in your own App Engine instance.
git clone https://github.com/singhj/locality-sensitive-hashing.git).
twitter_settings.pyas appropriate. Take care to specify a Callback URL correctly and Ours is
http://open-lsh.datathinks.org/twitter_callback, yours should be customized to your setup.
Example Peerbelt data is provided in the
example_data directory. To try OpenLSH against this file,
settings.pyto use the in-memory database.
serial.pyspecifying the provided data file as input.
First, write a line format class for parsing your input data.
Second, if you are working with a different database than Cassandra or Datastore, you will need to write a database driver; we can help with this if required.
A command-line implementation is available in
serial.py, which may be adapted if you will be running from the command line.
A web-based implementation is available in
A map-reduce based implementation may be available soon, contact us for details.
The implementation is intended to be compact in its use of storage; we don't store intermediate results (i.e., minhash values) in the database.
To compute the similarity of two documents, we need to programmatically compare them. And we do.
You may wish to make a different trade-off, and uncomment the minhash lines in the implementation of
Some of the LSH parameters are embedded in the
If you are implementing another database driver, or other changes that would benefit others, please send a pull request.