36 Commits

Author SHA1 Message Date
Kishore Nallan
c3298ba6d8 Address -Wall and -Wextra warnings. 2018-01-25 20:08:13 +05:30
Kishore Nallan
78b9ee52ec Make match score computation predictable and consistent across multiple indexes. 2017-11-12 22:31:29 +05:30
Kishore Nallan
e24e0fae5d Node score should be a int32_t. 2017-09-21 19:40:41 +05:30
Kishore Nallan
a2f475d7fc Enable ART to index and search on floating point numbers. 2017-08-09 18:17:26 -04:00
Kishore Nallan
b7bc974b8e Expose token ranking field properly via the API. 2017-05-27 14:02:32 +05:30
Kishore Nallan
4776b41dc1 Facet implementation. 2017-03-13 21:09:27 +05:30
Kishore Nallan
b880cfd531 Refactor forarray - split into individual classes. 2017-02-04 16:27:07 +05:30
Kishore Nallan
da68fb17e8 Support LESS_THAN and GREATER_THAN. 2017-01-22 05:40:10 +05:30
Kishore Nallan
0fcdb6b479 Support signed ints in art int search. 2017-01-12 21:20:52 +05:30
Kishore Nallan
0b88e669f6 Make ART fuzzy_search take min_cost and max_cost instead of only max_cost. 2016-12-28 18:16:43 +05:30
Kishore Nallan
12276b651f Base work for supporting multiple indexable fields. 2016-12-22 22:26:33 +05:30
Kishore Nallan
9b0c347334 ART - integer range search. 2016-12-11 13:47:43 +05:30
Kishore Nallan
9cc3e7e5ea Fixed a bug in pagination. 2016-11-27 21:30:13 +05:30
Kishore Nallan
e1526319f7 Building up support for prefix based searching and for ranking token suggestions by either frequency or max_score. 2016-11-27 14:56:15 +05:30
Kishore Nallan
aab5912110 Fuzzy search tests. 2016-11-07 19:36:28 +05:30
Kishore Nallan
ef105dcbd9 Reduce memory foot-print. 2016-10-04 21:31:55 +05:30
Kishore Nallan
c96a9d9b35 Adopt Damerau–Levenshtein distance, instead of plain Levenshtein.
Specifically, we use the optimal string alignment distance. It treats transposition as a cost of 1, rather than 2.

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance
2016-09-10 16:19:04 +05:30
Kishore Nallan
1a53d5692e Fuzzy match rewrite - still need to work on matching perf. 2016-09-04 12:22:16 +05:30
Kishore Nallan
334ce264a5 For now, disable prefix matches to be considered as whole matches. 2016-09-02 10:34:34 +05:30
Kishore Nallan
1c09ec38a8 Removed redundant token_count field from the leaf. 2016-08-24 09:27:32 +05:30
Kishore Nallan
30cd057201 Split-up fuzzy lookup into separate stages.
1. Collect all the nodes where cost exceeds threshold.
2. Sort these nodes based on a score.
3. Perform top-k iteration to locate high scoring leaves.

This ensures that small scoring leaves don't end up trumping leaves with higher score (as it was noticed).
2016-06-10 23:20:44 +05:30
Kishore Nallan
4face51091 Calculation of hits for a token had a bug.
Should use search rather than prefix lookup for finding the hits so far for the exact token.
2016-06-09 17:31:32 +05:30
Kishore Nallan
32cd67c9d1 The ART will store the frequency count in addition to the score.
In certain cases, the ability to identify the most similar tokens based on the popularity of the token is useful.
2016-06-08 22:18:52 +05:30
Kishore Nallan
bb0e7aefb9 Rename score to max_score for internal node and leaf structs. 2016-06-08 11:26:52 +05:30
Kishore Nallan
47df6201b1 Append offset related fields to the art leaf during insertion. 2016-02-28 21:13:54 +05:30
Kishore Nallan
0ba5c4874f Parameterized the number of fuzzy matches that are returned for words with typo. 2016-02-21 19:51:57 +05:30
Kishore Nallan
b88241d9e9 Bug fix: word suggestions were not showing up sorted on their document scores.
Somehow, std::max() on uint16_t does not seem to work. Using a MAX macro.
2016-02-21 19:21:20 +05:30
Kishore Nallan
8ff75e481d Replace callbacks with a result vector.
Document IDs for the given search token will be populated into this result vector.
2016-02-20 23:14:17 +05:30
Kishore Nallan
cb3b0e1a6e Using a proper document struct when representing leaf values.
Removed experimental submodules. Only using `for` now (compressed array).
2016-01-31 11:20:07 +05:30
Kishore Nallan
ee77fb4d22 Add 2 more external dependencies via git submodule. 2016-01-24 14:35:40 +05:30
Kishore Nallan
a662e43959 Top-K matches for a given substring seems to work. 2015-12-31 07:22:35 +05:30
Kishore Nallan
2dfc31a519 Sorting on popularity metric - WIP. Still has bugs. 2015-12-29 20:55:50 +05:30
Kishore Nallan
5246a1683d Adding a max_score field to intermediate nodes that denote the maximum score of lead nodes.
This is useful for pruning search space when we want to identify top-K matches for a given prefix.
2015-12-14 08:23:28 +05:30
Kishore Nallan
50a125f7ea Fixed a major bug with NODE256 iteration for prefix "twili". 2015-11-29 16:36:36 +05:30
Kishore Nallan
e4a2be3ac3 Rewriting fuzzy look-up using incremental levenshtein matrix. WIP. 2015-11-28 22:41:26 +05:30
Kishore Nallan
64f53b6420 Initial commit. Fuzzy prefix match works. 2015-11-10 19:44:44 +05:30