Added design and todo docs.

2025-05-17 04:02:36 +08:00 · 2016-08-16 21:17:37 +05:30 · 2016-08-16 21:17:37 +05:30 · 7147fa7ed5
commit 7147fa7ed5
parent 0eeb75b385
3 changed files with 74 additions and 2 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -0,0 +1,46 @@
+# Typesense: Design
+
+## Motivation
+
+Typesense's design is motivated by the following considerations:
+
+- **Simplicity:** Typesense has to be super simple to set-up and get started with. The default configuration 
+*should just work* for the common search use cases.
+- **Typo-tolerance out-of-the-box:** Currently, it's not at all easy to build a typo-tolerant search using existing 
+systems without a considerable speed/memory penalty. This has to change, given how common typographic errors are 
+in the real-world.
+- **In-memory:** All primary data structures would be in-memory, with the disk being used only for durability and for 
+fetching large, unindexed fields.
+- **Optimize for reads over writes:** A typical search engine is written once and read many times. The system should be 
+cognizant of this read/write pattern.
+- **Fast, without sacrificing relevance:** While speed is important, one cannot compromise on the quality of results 
+returned. Remember that the average reaction time for humans is 200ms to a visual stimulus.
+- **Laser focused on search:** While there might be some overlap with what a relational database does, strive to focus 
+primarily on search, instead of becoming a generalized data store with a complex query language.
+- **Availability over consistency**: In the event of a partition failure, it's better to give slightly stale search 
+results, than being unavailable. This is alright, given the inherent asynchronous nature of the indexing process.
+
+## Overview
+
+- At the heart of Typesense is a `token => documents` inverted index backed by an 
+[Adapative Radix Tree](https://db.in.tum.de/~leis/papers/ART.pdf), which is a memory-efficient implementation of the 
+Trie data structure. ART allows us to do fast fuzzy searches on a query.
+- Typesense consumes JSON documents as input. Fields that should be indexed must be specified via a configuration file 
+  or through the API.
+- The raw JSON documents are stored on disk using RocksDB. SSD disks are highly recommended.
+- Search results are ranked on the following factors:
+    - Number of matching tokens
+    - Proximity of search tokens within the documents that contain these tokens
+    - User specified static score for a document (for e.g. the number of followers could a static score for a 
+      Twitter user)
+- A typical search query involves:
+    - a search term (required - wild card `*` search is not allowed)
+    - filter fields (optional)
+    - facet fields (optional)
+    - sort fields (optional)
+    - page
+    - limit
+- Typesense is exposed through a RESTful API, so that it can be consumed directly by web apps via AJAX requests.
+- High Availability is achieved using Master-Master replication. Every write to Typesense would be written and 
+  acknowledged by another node before the write is deemed as a success. Clients can round-robin both reads and 
+  writes across both the nodes.
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # Typesense

-Typesense is an open source search engine for building delightful search experiences.
+Typesense is an open source search engine for building a delightful search experience.

 - **Typo tolerance:** Handles typographical errors out-of-the-box
 - **Tunable ranking + relevancy:** Tailor your search results to perfection
@ -16,7 +16,6 @@ TODO
 * [libfor](https://github.com/cruppstahl/for/)
 * [h2o](https://github.com/h2o/h2o)
 * OpenSSL
-* Boost

 ## Building `libfor`

--- a/TODO.md
+++ b/TODO.md
@ -0,0 +1,27 @@
+# Typesense: TODO
+
+## Pre-alpha
+
+**Search index**
+
+- Proper JSON as input
+- Storing raw JSON input to RocksDB
+- ART for every indexed field
+- UTF-8 support for fuzzy search
+- Facets
+- Filters
+- Support search operators like +, - etc.
+
+**API**
+
+- Support the following operations:
+    - create a new index
+    - index a single document
+    - bulk insert multiple documents
+    - fetch a document by ID
+    - delete a document by ID
+    - query an index       
+
+**Clustering**
+
+- Sync every incoming write with another Typesense server