typesense/TODO.md

# Typesense: TODO

a) ~~Fix memory ratio (decreasing with indexing)~~
b) ~~Speed up wildcard searches further~~
c) ~~Allow int64 in default sorting field~~
d) ~~Use connection timeout for CURL rather than request timeout~~
e) Async import

**Search index**

- ~~Proper JSON as input~~
- ~~Storing raw JSON input to RocksDB~~
- ~~ART for every indexed field~~
- ~~Delete should remove from RocksDB~~
- ~~Speed up UUID generation~~
- ~~Make the search score computation customizable~~
- ~~art int search should support signed ints~~
- ~~Search across multiple fields~~
- ~~Have set inside topster itself~~
- ~~Persist next_seq_id~~
- ~~collection_id should be int, not string~~
- ~~API should return count~~
- ~~Fix documents.jsonl path in tests~~
- ~~Multi field search tests~~
- ~~storage key prefix should include collection name~~
- ~~Index and search on multi-valued field~~
- ~~range search for art_int~~
- ~~Restore records as well on restart (like for meta)~~
- ~~drop collection should remove all records from the store~~
- ~~Multi-key binary search during scoring~~
- ~~Assumption that all tokens match for scoring is no longer true~~
- ~~Filters~~
- ~~Facets~~
- ~~Schema validation during insertion (missing fields + type errors)~~
- ~~Proper score field for ranking tokens~~
- ~~Throw errors when schema is broken~~
- ~~Desc/Asc ordering with tests~~
- ~~Found count is wrong~~
- ~~Filter query in the API~~
- ~~Facet limit (hardcode to top 10)~~
- ~~Deprecate old split function~~
- ~~Multiple facets not working~~
- ~~Search snippet with highlight~~
- ~~Snippet should only be around surrounding matching tokens~~
- ~~Proper pagination~~
- ~~Pagination parameter~~
- ~~Drop collection API~~
- ~~JSONP response~~
- ~~"error":"Not found." is sent when query has no hits~~
- ~~Fix API response codes~~
- ~~List all collections~~
- ~~Fetch an individual document~~
- ~~ID field should be a string: must validate~~
- ~~Number of records in collection~~
- ~~Test for asc/desc upper/lower casing~~
- ~~Test for search without any sort_by given~~
- ~~Test for collection creation validation~~
- ~~Test for delete document~~
- ~~art float search~~
- ~~When prefix=true, use default_sorting_field for token ordering only for last word~~
- ~~only last token should be prefix searched~~
- ~~Prefix-search strings should not be null terminated~~
- ~~sort results by float field~~
- ~~json::parse must be wrapped in try catch~~
- ~~Collection Manager collections map should store plain collection name~~
- ~~init_collection of Collection manager should probably take seq_id as param~~
- ~~node score should be int32, no longer uint16 like in document struct~~
- ~~Typo in prefix search~~
- ~~When field of "id" but not string, what happens?~~
- ~~test for num_documents~~
- ~~test for string filter comparison: title < "foo"~~
- ~~Test for sorted_array::indexOf when length is 0~~
- ~~Test for pagination~~
- ~~search_fields, sort_fields and facet fields should be combined~~
- ~~facet fields should be indexed verbatim~~
- ~~change "search_by" to "query_by"~~
- ~~during index_in_memory() validations should be front loaded~~
- ~~Support default sorting field being a float~~
- ~~https support~~
- ~~Validate before string to int conversion in the http api layer~~
- ~~art bool support~~
- ~~Export collection~~
- ~~get collection should show schema~~
- ~~API key should be allowed as a GET parameter also (for JSONP)~~
- ~~Don't crash when the data directory is not found~~
- ~~When the first sequence ID is not zero, bail out~~
- ~~Proper status code when sequence number to fetch is bad~~
- ~~Replica should be read-only~~
- ~~string_utils::tokenize should not have max length~~
- ~~handle hyphens (replace them)~~
- ~~clean special chars before indexing~~
- ~~Add docs/explanation around ranking calc~~
- ~~UTF-8 normalization~~
- ~~Use rocksdb batch put for atomic insertion~~
- ~~Proper logging~~
- ~~Handle store-get() not finding a key~~
- ~~Deprecate converting integer to string verbatim~~
- ~~Deprecate union type punning~~
- ~~Replica server should fail when pointed to "old" master~~
- ~~gzip compress responses~~
- ~~Have a LOG(ERROR) level~~
- ~~Handle SIGTERM which is sent when process is killed~~
- ~~Use snappy compression for storage~~
- ~~Fix exclude_scalar early returns~~
- ~~Fix result ids length during grouped overrides~~
- ~~Fix override grouping (collate_included_ids)~~
- ~~Test for overriding result on second page~~
- atleast 1 token match for proceeding with drop tokens
- support wildcard query with filters
- API for optimizing on disk storage
- Jemalloc
- Exact search
- NOT operator support
- Log operations
- Parameterize replica's MAX_UPDATES_TO_SEND
- NOT operator support
- 64K token limit
- > INT32_MAX validation for float field
- highlight of string arrays?
- test for token ranking on float field
- test for float int field deletion during doc deletion
- Test for snippets
- Test for replication
- Query token ids should match query token ordering
- ID should not have "/"
- Group results by field
- Delete using range: https://github.com/facebook/rocksdb/wiki/Delete-A-Range-Of-Keys
- Test for string utils
- Prevent string copy during indexing
- Minimum results should be a variable instead of blindly going with max_results
- Handle searching for non-existing fields gracefully
- test for same match score but different primary, secondary attr
- Support nested fields via "."
- Support search operators like +, - etc.
- Space sensitivity
- Use bitmap index instead of compressed array for doc list?
- Primary_rank_scores and secondary_rank_scores hashmaps should be combined?
- d-ary heap?
- ~~topster: reject min heap value compare only when field is same~~
- ~~match index instead of match score~~

**API**

- Support the following operations:
    - ~~create a new index~~
    - ~~index a single document~~
    - ~~delete a document by ID~~
    - ~~query an index~~
    - ~~Drop an index~~
    - ~~fetch a document by ID~~

**Clustering**

- Sync every incoming write with another Typesense server

**Refactoring**

- ~~`token_count` in leaf is redundant: can be accessed from value~~
- ~~storing length in `offsets` is redundant: it can be found by looking up value of the next index in offset_index~~

**Tech debt**

- ~~Use GLOB file pattern for CMake (better IDE refactoring support)~~
- DRY index_int64_field* methods