Some basics about the spell correction problem
preDict is based on the spell correction fuzzy search library SymSpell with a few customizations and optimization:
The fundamental beauty of SymSpell is the Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given edit distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Additionally only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input phrase are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
Our main goal was to increase accuracy whilst maintaining the increabible speed by adding:
Run on Windows10 with a Intel(R) Core(TM) i7-6700 CPU (2.60GHz) with Java(TM) 1.8.0_121
Benchmark Mode Cnt Score Error Units SearchHub with PreDict EE * thrpt 200 96,019 ± 1,007 ops/ms PreDict CE thrpt 200 82,116 ± 1,149 ops/ms Original SymSpell (Port) thrpt 200 68,105 ± 0,977 ops/ms Lucene (FST Based FuzzySuggester) thrpt 5 17,588 ± 0,690 ops/ms Lucene (Fuzzy Field-Search) thrpt 200 0,749 ± 0,017 ops/ms
Based on data we collected for a few months. The test data is attached to the comparison project and can be changed. Changes to the data will, of course, change the results, but the differences shouldn't be that dramatical.
Benchmark Accuracy TP TN Fail-Rate FN FP SearchHub with PreDict EE * 98,87% 7452 355 1,13% 52 37 PreDict CE 90,04% 6937 320 9,96% 496 307 Original SymSpell (Port) 88,87% 6842 321 11,13% 591 306 Lucene (Fuzzy Field-Search) 88,87% 6803 360 11,13% 630 267 Lucene (FST based FuzzySuggester) 78,96% 5883 481 21,04% 1550 146
*SearchHub with PreDict EE represents our commercial offering https://www.searchhub.io This offering is a search platform independent, AI-powered search query intelligence API containing our concept of controlled precision reduction and PreDict EE (Enterprise Edition) which is capable of handling language agnostic term decomposition and disambiguation.