From 7147fa7ed54f0b008e4cede0bae33071e850770a Mon Sep 17 00:00:00 2001 From: Kishore Nallan Date: Tue, 16 Aug 2016 21:17:37 +0530 Subject: [PATCH] Added design and todo docs. --- DESIGN.md | 46 ++++++++++++++++++++++++++++++++++++++++++++++ README.md | 3 +-- TODO.md | 27 +++++++++++++++++++++++++++ 3 files changed, 74 insertions(+), 2 deletions(-) create mode 100644 DESIGN.md create mode 100644 TODO.md diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 00000000..c943a7b6 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,46 @@ +# Typesense: Design + +## Motivation + +Typesense's design is motivated by the following considerations: + +- **Simplicity:** Typesense has to be super simple to set-up and get started with. The default configuration +*should just work* for the common search use cases. +- **Typo-tolerance out-of-the-box:** Currently, it's not at all easy to build a typo-tolerant search using existing +systems without a considerable speed/memory penalty. This has to change, given how common typographic errors are +in the real-world. +- **In-memory:** All primary data structures would be in-memory, with the disk being used only for durability and for +fetching large, unindexed fields. +- **Optimize for reads over writes:** A typical search engine is written once and read many times. The system should be +cognizant of this read/write pattern. +- **Fast, without sacrificing relevance:** While speed is important, one cannot compromise on the quality of results +returned. Remember that the average reaction time for humans is 200ms to a visual stimulus. +- **Laser focused on search:** While there might be some overlap with what a relational database does, strive to focus +primarily on search, instead of becoming a generalized data store with a complex query language. +- **Availability over consistency**: In the event of a partition failure, it's better to give slightly stale search +results, than being unavailable. This is alright, given the inherent asynchronous nature of the indexing process. + +## Overview + +- At the heart of Typesense is a `token => documents` inverted index backed by an +[Adapative Radix Tree](https://db.in.tum.de/~leis/papers/ART.pdf), which is a memory-efficient implementation of the +Trie data structure. ART allows us to do fast fuzzy searches on a query. +- Typesense consumes JSON documents as input. Fields that should be indexed must be specified via a configuration file + or through the API. +- The raw JSON documents are stored on disk using RocksDB. SSD disks are highly recommended. +- Search results are ranked on the following factors: + - Number of matching tokens + - Proximity of search tokens within the documents that contain these tokens + - User specified static score for a document (for e.g. the number of followers could a static score for a + Twitter user) +- A typical search query involves: + - a search term (required - wild card `*` search is not allowed) + - filter fields (optional) + - facet fields (optional) + - sort fields (optional) + - page + - limit +- Typesense is exposed through a RESTful API, so that it can be consumed directly by web apps via AJAX requests. +- High Availability is achieved using Master-Master replication. Every write to Typesense would be written and + acknowledged by another node before the write is deemed as a success. Clients can round-robin both reads and + writes across both the nodes. \ No newline at end of file diff --git a/README.md b/README.md index c0cb6be1..2efbe7ae 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Typesense -Typesense is an open source search engine for building delightful search experiences. +Typesense is an open source search engine for building a delightful search experience. - **Typo tolerance:** Handles typographical errors out-of-the-box - **Tunable ranking + relevancy:** Tailor your search results to perfection @@ -16,7 +16,6 @@ TODO * [libfor](https://github.com/cruppstahl/for/) * [h2o](https://github.com/h2o/h2o) * OpenSSL -* Boost ## Building `libfor` diff --git a/TODO.md b/TODO.md new file mode 100644 index 00000000..1a93943e --- /dev/null +++ b/TODO.md @@ -0,0 +1,27 @@ +# Typesense: TODO + +## Pre-alpha + +**Search index** + +- Proper JSON as input +- Storing raw JSON input to RocksDB +- ART for every indexed field +- UTF-8 support for fuzzy search +- Facets +- Filters +- Support search operators like +, - etc. + +**API** + +- Support the following operations: + - create a new index + - index a single document + - bulk insert multiple documents + - fetch a document by ID + - delete a document by ID + - query an index + +**Clustering** + +- Sync every incoming write with another Typesense server