Latent Semantic Indexing

Latent Semantic Indexing(LSI) is a common technique in natural language processing area. This article is about how LSI works by comparing the pure key-word-based search.

What is LSI?

Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. – wiki

For example, Paris and Hilton are associated with a woman instead of a city and a hotel, Tiger and Woods are associated with golf.

Regular Keyword Search vs. LSI

By using regular keyword search, a document either contains the given word or not, and there is no middle ground.

LSI adds an important step to the document indexing process. LSI examines a collection of documents to see which documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with less words in common to be less close.

When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t contain the keyword at all.

An LSI Example

If we use LSI to index a collection of articles and the words “program” and “code” appear together frequently enough, the search algorithm will notice that the two terms are semantically close. A search for “program” will therefore return a set of articles containing that phrase, but also articles that contain just the word “code”. LSI does not understand the word distance, but by examining a sufficient number of documents, it knows the two terms are related. It then uses that information to provide an expanded set of results with better recall than a plain keyword search.

The diagram below describe the effect between LSI and keyword search. W stands for a document.

Latent Semantic Indexing


1. seobook
2. misconceptions

2 thoughts on “Latent Semantic Indexing”


    jual obat nyeri sendi di tasikmalaya

    jual obat nyeri sendi di jakarta

    jual obat nyeri sendi di pontianak

    jual obat nyeri sendi di samarinda jual obat nyeri sendi di pekanbaru

    jual obat nyeri sendi di tangerang
    jual obat nyeri sendi di bekasi
    jual obat nyeri sendi di depok

    jual cream pemutih wajah cream HN jual cream HN

    jual cream pemutih wajah

    jual cream HN

    jual cream HN

    jual obat kesehatan jual madu asli tempat service komputer bekasi tempat sablon kaos cikarang

Leave a Comment