Lucene vs. Database Search

Full-text search mechanism

Lucene’s API interface design is relatively generic, which looks like the structure of the database: tables -> record -> field. Many traditional applications, files, and databases can be easily mapped to the storage structure of Lucene / interface. Overall you can see Lucene as a database system to support full-text index.

 

Compare Lucene and the database

lucene vs database

Since the database index is not designed for the full-text index, so by using like “% keyword%”, the database index does not work. Therefore, the database query with fuzzy terms, LIKE, can harm performance. If you need more than one fuzzy matching words: like “% keyword1%” and like “% keyword2%” … the efficiency also will be damaged more.

Therefore, for an efficient retrieval system, the key is to create a similar technology like indexing mechanism for indexing the data source (e.g. articles) and storing the keywords table at the same time. Then the mappings (Keywordsarticle) are built. The retrieval process is just for find those mappings, so this greatly improving the efficiency of multi-keyword query.

The table below compares lucene with database.

Lucene text indexing Database
Index index the data source through full-text index and create reverse index For the LIKE query, the data is no access to traditional index.Facilitated by using  GREP-style fuzzy matching.
Match results By the word element (term) to match, through the implementation of language analysis interface, docs can be achieved for Chinese and other non-English language Use: like “% org%” will also match “organization”.
Multiple fuzzy matching keywords: use like “% com% org%”: it can not match the word order reversed xxx.org.. xxx.com
Matching A matching algorithm matches (similarity) to different degree. Does not have the degree of control: for example, a word appears 5 times or once are the same
The resulting output Through a special algorithm, the best matches are returned first, and small amount are returned each time. Returns all result set that matches, so requires a lot of memory storing the temporary result set.
Customization Can easily customize the applications with index rules (including Chinese language support) No customization
Conclusion Good for High-load fuzzy query, and large corpus Not efficient, good for no fuzzy query

Lucene’s innovations

Most of the search (database) engines use a B-tree structure are to maintain the index, which causes a lot of IO operations. Lucene is improved by periodically adding these new small index file into the original large index, so it does not affect the retrieval efficiency under the premise of improving the efficiency of the index.

* My study notes for lucene, if there any understanding is not exactly correct, please leave your comments. Thanks.

Once you understand what is Lucene by comparing with database. It’s easy to use an example to get start.
Below is a simple example, in which the code part is pretty good, and explanation can be ignored since it is not in English.
Address: http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/

2 thoughts on “Lucene vs. Database Search”

  1. If you can present more details on how lucene index works, that would be great. Personally I’m comfortable with my understanding of b*tree in terms of things like: adding an entry, deleting an entry, splitting a block, scanning, range scan and all that stuff. And more stuff, reading an index block, the kinds of searchs that are helped by b*tree, and *not* helped by b*tree. I would love to get to a similar place with lucene indexing.

  2. Tks for ur article !!
    I’m wandering why B-Tree caused many I/O operation , and Lucene add small index to original Large index part is also indistinct ,could u please give further explanation . 😀

Leave a Comment