81 Commits

Author SHA1 Message Date
Kishore Nallan
080eceea79 Remove bit packing - use proper struct. 2016-09-27 20:53:38 +05:30
Kishore Nallan
1cf5eb9d9c Fix path to source directory for make. 2016-09-26 08:18:00 +05:30
Kishore Nallan
5cd8b72d0b Fixed a bug in top-K sorting. 2016-09-25 13:10:34 +05:30
Kishore Nallan
e777afc97f API for removing a document from index. 2016-09-24 18:08:57 +05:30
Kishore Nallan
9f75b70b07 Add document end-point. 2016-09-13 21:35:21 +05:30
Kishore Nallan
59f25dca39 Fix libfor repository URL - updated CMakeLists & README. 2016-09-13 18:22:46 +05:30
Kishore Nallan
e7c6c6d3cb Fixed multi word queries. 2016-09-12 14:25:07 +05:30
Kishore Nallan
2f26b95c5b Intermediate matching nodes should not be pushed to the results vector. 2016-09-11 12:13:04 +05:30
Kishore Nallan
c96a9d9b35 Adopt Damerau–Levenshtein distance, instead of plain Levenshtein.
Specifically, we use the optimal string alignment distance. It treats transposition as a cost of 1, rather than 2.

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance
2016-09-10 16:19:04 +05:30
Kishore Nallan
618a2020ed Delete the old fuzzy search implementation. 2016-09-06 17:43:25 +05:30
Kishore Nallan
c25f7ccfdb Parameterize the num_typos from the query end-point. 2016-09-05 19:13:44 +05:30
Kishore Nallan
93de59be29 Added some conditions for search space reduction that puts performance back to original implementation. 2016-09-05 10:58:32 +05:30
Kishore Nallan
1a53d5692e Fuzzy match rewrite - still need to work on matching perf. 2016-09-04 12:22:16 +05:30
Kishore Nallan
334ce264a5 For now, disable prefix matches to be considered as whole matches. 2016-09-02 10:34:34 +05:30
Kishore Nallan
aa46985bab Handle spaces in query string. 2016-08-30 22:07:17 +05:30
Kishore Nallan
44da808f16 RocksDB based persistence. 2016-08-28 22:04:58 +05:30
Kishore Nallan
1d3af330dd JSON document as input to collection.add method. 2016-08-28 09:23:30 +05:30
Kishore Nallan
2804b145dd Add OS X build instructions. 2016-08-27 22:48:01 +05:30
Kishore Nallan
4d2ba27cab Release memory of value stored when art node is destroyed. 2016-08-27 19:49:52 +05:30
Kishore Nallan
94db15b715 Fixed various issues flagged by Valgrind. 2016-08-27 13:44:53 +05:30
Kishore Nallan
1c36238f19 Fix debug flag. 2016-08-24 22:32:16 +05:30
Kishore Nallan
1e71058917 Length of char* was being calculated wrongly.
Need to consider the terminating null character.
2016-08-24 22:31:50 +05:30
Kishore Nallan
1c09ec38a8 Removed redundant token_count field from the leaf. 2016-08-24 09:27:32 +05:30
Kishore Nallan
2a77a1ad66 Removed redundant storage of length in offsets array. 2016-08-24 08:46:02 +05:30
Kishore Nallan
c079b22cbd Fix typo in test document harness.
Added better print debugging in the process.
2016-08-23 22:37:54 +05:30
Kishore Nallan
9b6547f050 Refactor index to be called as collection. 2016-08-23 20:32:37 +05:30
Kishore Nallan
ae34ae3195 Add JSON dep. 2016-08-23 20:31:11 +05:30
Kishore Nallan
7147fa7ed5 Added design and todo docs. 2016-08-16 21:17:37 +05:30
Kishore Nallan
0eeb75b385 Boost dep is not needed. 2016-08-16 14:57:29 +05:30
Kishore Nallan
e6306ac432 Remove crow as dep. 2016-08-14 15:37:45 +05:30
Kishore Nallan
4f10586d13 Add skeleton HTTP server for serving the RESTish API. 2016-08-14 12:20:41 +05:30
Kishore Nallan
a228d153a6 Update README. 2016-08-14 12:19:35 +05:30
Kishore Nallan
a927a32018 Breaking down the long search method into smaller chunks. 2016-08-07 15:59:49 -07:00
Kishore Nallan
ba33da1d51 Lots of code clean up.
* Move stuff out of main to classes
* Standardize naming conventions.
2016-08-07 14:55:26 -07:00
Kishore Nallan
6c2974aaeb Add crow as a dep - http framework. 2016-08-07 14:54:26 -07:00
Kishore Nallan
e1f4b3d513 Constantize arguments, some clean-up code. 2016-08-05 18:26:31 -07:00
Kishore Nallan
45f0814a7a Fixed a bug in fuzzy search.
When the term_len is reached during traversal, max_cost was being compared with the wrong value.
2016-06-11 22:16:49 +05:30
Kishore Nallan
30cd057201 Split-up fuzzy lookup into separate stages.
1. Collect all the nodes where cost exceeds threshold.
2. Sort these nodes based on a score.
3. Perform top-k iteration to locate high scoring leaves.

This ensures that small scoring leaves don't end up trumping leaves with higher score (as it was noticed).
2016-06-10 23:20:44 +05:30
Kishore Nallan
4face51091 Calculation of hits for a token had a bug.
Should use search rather than prefix lookup for finding the hits so far for the exact token.
2016-06-09 17:31:32 +05:30
Kishore Nallan
734640cd2a Fix size calculation for unsorted append. 2016-06-09 17:28:50 +05:30
Kishore Nallan
32cd67c9d1 The ART will store the frequency count in addition to the score.
In certain cases, the ability to identify the most similar tokens based on the popularity of the token is useful.
2016-06-08 22:18:52 +05:30
Kishore Nallan
bb0e7aefb9 Rename score to max_score for internal node and leaf structs. 2016-06-08 11:26:52 +05:30
Kishore Nallan
5591f564c8 Sort the results vector based on score finally.
Required when a multiple leaves of a given node are candidate token suggestions.
2016-06-08 10:23:02 +05:30
Kishore Nallan
04d02919b2 Fix memory corruption during unsorted append. 2016-06-04 19:05:47 +05:30
Kishore Nallan
c029e620d9 Clean up the match scoring logic.
Added more comments to illustrate what's happening.
2016-05-31 19:03:40 +05:30
Kishore Nallan
80d9f57b7b Code clean-up. 2016-05-30 20:13:55 +05:30
Kishore Nallan
0f756efe74 Fix sorting - should be in ascending order. 2016-05-30 20:13:44 +05:30
Kishore Nallan
beba88c1da Positional offsets are unsorted, so should be using unsorted append. 2016-05-30 20:12:04 +05:30
Kishore Nallan
3dde71e72e Unsorted append to forarray. 2016-05-30 20:11:26 +05:30
Kishore Nallan
383212be46 Fix bugs in top-K implementation. 2016-05-15 09:01:05 +05:30