Kishore Nallan
45f0814a7a
Fixed a bug in fuzzy search.
...
When the term_len is reached during traversal, max_cost was being compared with the wrong value.
2016-06-11 22:16:49 +05:30
Kishore Nallan
30cd057201
Split-up fuzzy lookup into separate stages.
...
1. Collect all the nodes where cost exceeds threshold.
2. Sort these nodes based on a score.
3. Perform top-k iteration to locate high scoring leaves.
This ensures that small scoring leaves don't end up trumping leaves with higher score (as it was noticed).
2016-06-10 23:20:44 +05:30
Kishore Nallan
4face51091
Calculation of hits for a token had a bug.
...
Should use search rather than prefix lookup for finding the hits so far for the exact token.
2016-06-09 17:31:32 +05:30
Kishore Nallan
734640cd2a
Fix size calculation for unsorted append.
2016-06-09 17:28:50 +05:30
Kishore Nallan
32cd67c9d1
The ART will store the frequency count in addition to the score.
...
In certain cases, the ability to identify the most similar tokens based on the popularity of the token is useful.
2016-06-08 22:18:52 +05:30
Kishore Nallan
bb0e7aefb9
Rename score
to max_score
for internal node and leaf structs.
2016-06-08 11:26:52 +05:30
Kishore Nallan
5591f564c8
Sort the results vector based on score finally.
...
Required when a multiple leaves of a given node are candidate token suggestions.
2016-06-08 10:23:02 +05:30
Kishore Nallan
04d02919b2
Fix memory corruption during unsorted append.
2016-06-04 19:05:47 +05:30
Kishore Nallan
c029e620d9
Clean up the match scoring logic.
...
Added more comments to illustrate what's happening.
2016-05-31 19:03:40 +05:30
Kishore Nallan
80d9f57b7b
Code clean-up.
2016-05-30 20:13:55 +05:30
Kishore Nallan
0f756efe74
Fix sorting - should be in ascending order.
2016-05-30 20:13:44 +05:30
Kishore Nallan
beba88c1da
Positional offsets are unsorted, so should be using unsorted append.
2016-05-30 20:12:04 +05:30
Kishore Nallan
3dde71e72e
Unsorted append to forarray.
2016-05-30 20:11:26 +05:30
Kishore Nallan
383212be46
Fix bugs in top-K implementation.
2016-05-15 09:01:05 +05:30
Kishore Nallan
884a83f53c
Use lower bound search to implement indexOf()
2016-05-15 09:00:42 +05:30
Kishore Nallan
10ff747802
Minor refactoring. Adding more comments.
2016-04-26 20:49:24 +05:30
Kishore Nallan
c667ed5d10
Fix static linking with libfor.
2016-04-25 21:51:02 +05:30
Kishore Nallan
f0f57f2e2d
Saving state.
2016-03-23 07:38:43 +05:30
Kishore Nallan
566c4ce666
Intersection of documents across the search tokens.
2016-02-29 19:47:05 +05:30
Kishore Nallan
47df6201b1
Append offset related fields to the art leaf during insertion.
2016-02-28 21:13:54 +05:30
Kishore Nallan
1a7350c0ec
Cartesian product of word suggestions for each query token to form search phrases.
2016-02-28 09:24:23 +05:30
Kishore Nallan
71a9c2709b
Bug fix: Wrong order of arguments when recursing.
2016-02-28 09:01:58 +05:30
Kishore Nallan
0ba5c4874f
Parameterized the number of fuzzy matches that are returned for words with typo.
2016-02-21 19:51:57 +05:30
Kishore Nallan
b88241d9e9
Bug fix: word suggestions were not showing up sorted on their document scores.
...
Somehow, std::max() on uint16_t does not seem to work. Using a MAX macro.
2016-02-21 19:21:20 +05:30
Kishore Nallan
8ff75e481d
Replace callbacks with a result vector.
...
Document IDs for the given search token will be populated into this result vector.
2016-02-20 23:14:17 +05:30
Kishore Nallan
1ffe38b5c8
Grow the forarray properly depending on the data stored.
2016-02-20 23:12:55 +05:30
Kishore Nallan
cb3b0e1a6e
Using a proper document struct when representing leaf values.
...
Removed experimental submodules. Only using `for` now (compressed array).
2016-01-31 11:20:07 +05:30
Kishore Nallan
ee77fb4d22
Add 2 more external dependencies via git submodule.
2016-01-24 14:35:40 +05:30
Kishore Nallan
22a63be16b
Add external deps via git modules.
2016-01-23 18:23:00 +05:30
Kishore Nallan
c095c166f0
Adding external dependencies.
2016-01-17 19:11:05 +05:30
Kishore Nallan
a662e43959
Top-K matches for a given substring seems to work.
2015-12-31 07:22:35 +05:30
Kishore Nallan
2dfc31a519
Sorting on popularity metric - WIP. Still has bugs.
2015-12-29 20:55:50 +05:30
Kishore Nallan
6e87b65598
Migrating ART to CPP.
2015-12-14 15:42:09 +05:30
Kishore Nallan
5246a1683d
Adding a max_score field to intermediate nodes that denote the maximum score of lead nodes.
...
This is useful for pruning search space when we want to identify top-K matches for a given prefix.
2015-12-14 08:23:28 +05:30
Kishore Nallan
8f91f11cb1
Iterate index only till end of key len, without considering depth of the term length.
2015-11-30 18:04:50 +05:30
Kishore Nallan
eb1e68620a
Fixed a LEAF node issue for "amzfing" with threshold of 2.
...
Was producing too many spurious single char matches.
2015-11-29 22:13:21 +05:30
Kishore Nallan
ba39be766c
Prevent early return during recursion inside loop.
...
Fixed "amazin" with 0 threshold.
2015-11-29 21:18:48 +05:30
Kishore Nallan
f836443ad9
Fixed a bug with NODE48 traversal for exact search of "amazing".
2015-11-29 19:35:12 +05:30
Kishore Nallan
50a125f7ea
Fixed a major bug with NODE256 iteration for prefix "twili".
2015-11-29 16:36:36 +05:30
Kishore Nallan
0d1eca8229
Move duplicating code to macro.
2015-11-29 09:08:39 +05:30
Kishore Nallan
619a3972d8
Fix another edge case involving early end of term.
2015-11-29 08:22:05 +05:30
Kishore Nallan
e4a2be3ac3
Rewriting fuzzy look-up using incremental levenshtein matrix. WIP.
2015-11-28 22:41:26 +05:30
Kishore Nallan
b7dbec8535
More bug fixes for fuzzy match.
2015-11-26 08:01:08 +05:30
Kishore Nallan
025d3b6bce
Fix bugs in fuzzy match.
2015-11-26 07:21:01 +05:30
Kishore Nallan
64f53b6420
Initial commit. Fuzzy prefix match works.
2015-11-10 19:44:44 +05:30