Latent Semantic Indexing(LSI) is a common technique in natural language processing area. This article is about how LSI works by comparing the pure key-word-based search.
What is LSI?
Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. - wiki
For example, Paris and Hilton are associated with a woman instead of a city and a hotel, Tiger and Woods are associated with golf.
Regular Keyword Search vs. LSI
By using regular keyword search, a document either contains the given word or not, and there is no middle ground.
LSI adds an important step to the document indexing process. LSI examines a collection of documents to see which documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with less words in common to be less close.
When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all.
An LSI Example
If we use LSI to index a collection of articles and the words “program” and “code” appear together frequently enough, the search algorithm will notice that the two terms are semantically close. A search for “program” will therefore return a set of articles containing that phrase, but also articles that contain just the word “code”. LSI does not understand the word distance, but by examining a sufficient number of documents, it knows the two terms are related. It then uses that information to provide an expanded set of results with better recall than a plain keyword search.
The diagram below describe the effect between LSI and keyword search. W stands for a document.
<pre><code> String foo = "bar"; </code></pre>