Kishore Nallan
aab5912110
Fuzzy search tests.
2016-11-07 19:36:28 +05:30
Kishore Nallan
ef105dcbd9
Reduce memory foot-print.
2016-10-04 21:31:55 +05:30
Kishore Nallan
c96a9d9b35
Adopt Damerau–Levenshtein distance, instead of plain Levenshtein.
...
Specifically, we use the optimal string alignment distance. It treats transposition as a cost of 1, rather than 2.
https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance
2016-09-10 16:19:04 +05:30
Kishore Nallan
1a53d5692e
Fuzzy match rewrite - still need to work on matching perf.
2016-09-04 12:22:16 +05:30
Kishore Nallan
334ce264a5
For now, disable prefix matches to be considered as whole matches.
2016-09-02 10:34:34 +05:30
Kishore Nallan
1c09ec38a8
Removed redundant token_count field from the leaf.
2016-08-24 09:27:32 +05:30
Kishore Nallan
30cd057201
Split-up fuzzy lookup into separate stages.
...
1. Collect all the nodes where cost exceeds threshold.
2. Sort these nodes based on a score.
3. Perform top-k iteration to locate high scoring leaves.
This ensures that small scoring leaves don't end up trumping leaves with higher score (as it was noticed).
2016-06-10 23:20:44 +05:30
Kishore Nallan
4face51091
Calculation of hits for a token had a bug.
...
Should use search rather than prefix lookup for finding the hits so far for the exact token.
2016-06-09 17:31:32 +05:30
Kishore Nallan
32cd67c9d1
The ART will store the frequency count in addition to the score.
...
In certain cases, the ability to identify the most similar tokens based on the popularity of the token is useful.
2016-06-08 22:18:52 +05:30
Kishore Nallan
bb0e7aefb9
Rename score
to max_score
for internal node and leaf structs.
2016-06-08 11:26:52 +05:30
Kishore Nallan
47df6201b1
Append offset related fields to the art leaf during insertion.
2016-02-28 21:13:54 +05:30
Kishore Nallan
0ba5c4874f
Parameterized the number of fuzzy matches that are returned for words with typo.
2016-02-21 19:51:57 +05:30
Kishore Nallan
b88241d9e9
Bug fix: word suggestions were not showing up sorted on their document scores.
...
Somehow, std::max() on uint16_t does not seem to work. Using a MAX macro.
2016-02-21 19:21:20 +05:30
Kishore Nallan
8ff75e481d
Replace callbacks with a result vector.
...
Document IDs for the given search token will be populated into this result vector.
2016-02-20 23:14:17 +05:30
Kishore Nallan
cb3b0e1a6e
Using a proper document struct when representing leaf values.
...
Removed experimental submodules. Only using `for` now (compressed array).
2016-01-31 11:20:07 +05:30
Kishore Nallan
ee77fb4d22
Add 2 more external dependencies via git submodule.
2016-01-24 14:35:40 +05:30
Kishore Nallan
a662e43959
Top-K matches for a given substring seems to work.
2015-12-31 07:22:35 +05:30
Kishore Nallan
2dfc31a519
Sorting on popularity metric - WIP. Still has bugs.
2015-12-29 20:55:50 +05:30
Kishore Nallan
5246a1683d
Adding a max_score field to intermediate nodes that denote the maximum score of lead nodes.
...
This is useful for pruning search space when we want to identify top-K matches for a given prefix.
2015-12-14 08:23:28 +05:30
Kishore Nallan
50a125f7ea
Fixed a major bug with NODE256 iteration for prefix "twili".
2015-11-29 16:36:36 +05:30
Kishore Nallan
e4a2be3ac3
Rewriting fuzzy look-up using incremental levenshtein matrix. WIP.
2015-11-28 22:41:26 +05:30
Kishore Nallan
64f53b6420
Initial commit. Fuzzy prefix match works.
2015-11-10 19:44:44 +05:30