Full-text search mechanism
Lucene's API interface design is relatively generic, which looks like the structure of the database: tables -> record -> field. Many traditional applications, files, and databases can be easily mapped to the storage structure of Lucene / interface. Overall you can see Lucene as a database system to support full-text index.
Compare Lucene and the database
Since the database index is not designed for the full-text index, so by using like "% keyword%", the database index does not work. Therefore, the database query with fuzzy terms, LIKE, can harm performance. If you need more than one fuzzy matching words: like "% keyword1%" and like "% keyword2%" ... the efficiency also will be damaged more.
Therefore, for an efficient retrieval system, the key is to create a similar technology like indexing mechanism for indexing the data source (e.g. articles) and storing the keywords table at the same time. Then the mappings (Keywordsarticle) are built. The retrieval process is just for find those mappings, so this greatly improving the efficiency of multi-keyword query.
The table below compares lucene with database.
|Lucene text indexing||Database|
|Index||index the data source through full-text index and create reverse index||For the LIKE query, the data is no access to traditional index.Facilitated by using GREP-style fuzzy matching.|
|Match results||By the word element (term) to match, through the implementation of language analysis interface, docs can be achieved for Chinese and other non-English language||Use: like "% org%" will also match "organization".
Multiple fuzzy matching keywords: use like "% com% org%": it can not match the word order reversed xxx.org.. xxx.com
|Matching||A matching algorithm matches (similarity) to different degree.||Does not have the degree of control: for example, a word appears 5 times or once are the same|
|The resulting output||Through a special algorithm, the best matches are returned first, and small amount are returned each time.||Returns all result set that matches, so requires a lot of memory storing the temporary result set.|
|Customization||Can easily customize the applications with index rules (including Chinese language support)||No customization|
|Conclusion||Good for High-load fuzzy query, and large corpus||Not efficient, good for no fuzzy query|
Most of the search (database) engines use a B-tree structure are to maintain the index, which causes a lot of IO operations. Lucene is improved by periodically adding these new small index file into the original large index, so it does not affect the retrieval efficiency under the premise of improving the efficiency of the index.
* My study notes for lucene, if there any understanding is not exactly correct, please leave your comments. Thanks.
Once you understand what is Lucene by comparing with database. It's easy to use an example to get start.
Below is a simple example, in which the code part is pretty good, and explanation can be ignored since it is not in English.
<pre><code> String foo = "bar"; </code></pre>