# TLog Spill-By-Reference Design
## Background
(This assumes a basic familiarity with [FoundationDB's architecture](https://www.youtu.be/EMwhsGsxfPU).)
Transaction logs are a distributed Write-Ahead-Log for FoundationDB. They
receive commits from commit proxies, and are responsible for durably storing
those commits, and making them available to storage servers for reading.
Clients send *mutations*, the list of their set, clears, atomic operations,
etc., to commit proxies. Commit proxies collect mutations into a *batch*, which
is the list of all changes that need to be applied to the database to bring it
from version `N-1` to `N`. Commit proxies then walk through their in-memory
mapping of shard boundaries to associate one or more *tags*, a small integer
uniquely identifying a destination storage server, with each mutation. They
then send a *commit*, the full list of `(tags, mutation)` for each mutation in
a batch, to the transaction logs.
The transaction log has two responsibilities: it must persist the commits to
disk and notify the proxy when a commit is durably stored, and it must make the
commit available for consumption by the storage server. Each storage server
*peeks* its own tag, which requests all mutations from the transaction log with
the given tag at a given version or above. After a storage server durably
applies the mutations to disk, it *pops* the transaction logs with the same tag
and its new durable version, notifying the transaction logs that they may
discard mutations with the given tag and a lesser version.
To persist commits, a transaction log appends commits to a growable on-disk
ring buffer, called a *disk queue*, in version order. Commit data is *pushed*
onto the disk queue, and when all mutations in the oldest commit persisted are
no longer needed, the disk queue is *popped* to trim its tail.
To make commits available to storage servers efficiently, a transaction log
maintains a copy of the commit in-memory, and maintains one queue per tag that
indexes the location of each mutation in each commit with the specific tag,
sequentially. This way, responding to a peek from a storage server only
requires sequentailly walking through the queue, and copying each mutation
referenced into the response buffer.
Transaction logs internally handle commits via performing two operations
concurrently. First, they walk through each mutation in the commit, and push
the mutation onto an in-memory queue of mutations destined for that tag.
Second, they include the data in the next batch of pages to durably persist to
disk. These in-memory queues are popped from when the corresponding storage
server has persisted the data to its own disk. The disk queue only exists to
allow the in-memory queues to be rebuilt if the transaction log crashes, is
never read from except during a transaction log recovering post-crash, and is
popped when the oldest version it contains is no longer needed in memory.
TLogs will need to hold the last 5-7 seconds of mutations. In normal
operation, the default 1.5GB of memory is enough such that the last 5-7 seconds
of commits should almost always fit in memory. However, in the presence of
failures, the transaction log can be required to buffer significantly more
data. Most notably, when a storage server fails, its tag isn't popped until
data distribution is able to re-replicate all of the shards that storage server
was responsible for to other storage servers. Before that happens, mutations
will accumulate on the TLog destined for the failed storage server, in case it
comes back and is able to rejoin the cluster.
When this accumulation causes the memory required to hold all the unpopped data
to exceed `TLOG_SPILL_THREASHOLD` bytes, the transaction log offloads the
oldest data to disk. This writing of data to disk to reduce TLog memory
pressure is referred to as *spilling*.
**************************************************************
* Transaction Log *
* *
* *
* +------------------+ pushes +------------+ *
* | Incoming Commits |----------->| Disk Queue | +------+ *
* +------------------+ +------------+ |SQLite| *
* | ^ +------+ *
* | | ^ *
* | pops | *
* +------+-------+------+ | writes *
* | | | | | *
* v v v +----------+ *
* in-memory +---+ +---+ +---+ |Spill Loop| *
* queues | 1 | | 2 | | 3 | +----------+ *
* per-tag | | | | | | ^ *
* |...| |...| |...| | *
* | | | | *
* v v v | *
* +-------+------+--------------+ *
* queues spilled on overflow *
* *
**************************************************************
## Overview
Previously, spilling would work by writing the data to a SQLite B-tree. The
key would be `(tag, version)`, and the value would be all the mutations
destined for the given tag at the given version. Peek requests have a start
version, that is the latest version for which the storage server knows about,
and the TLog responds by range-reading the B-tree from the start version. Pop
requests allow the TLog to forget all mutations for a tag until a specific
version, and the TLog thus issues a range clear from `(tag, 0)` to
`(tag, pop_version)`. After spilling, the durably written data in the disk
queue would be trimmed to only include from the spilled version on, as any
required data is now entirely, durably held in the B-tree. As the entire value
is copied into the B-tree, this method of spilling will be referred to as
*spill-by-value* in the rest of this document.
Unfortunately, it turned out that spilling in this fashion greatly impacts TLog
performance. A write bandwidth saturation test was run against a cluster, with
a modification to the transaction logs to have them act as if there was one
storage server that was permanently failed; it never sent pop requests to allow
the TLog to remove data from memory. After 15min, the write bandwidth had
reduced to 30% of its baseline. After 30min, that became 10%. After 60min,
that became 5%. Writing entire values gives an immediate 3x additional write
amplification, and the actual write amplification increases as the B-tree gets
deeper. (This is an intentional illustration of the worst case, due to the
workload being a saturating write load.)
With the recent multi-DC/multi-region work, a failure of a remote data center
would cause transaction logs to need to buffer all commits, as every commit is
tagged as destined for the remote datacenter. This would rapidly push
transaction logs into a spilling regime, and thus write bandwidth would begin
to rapidly degrade. It is unacceptable for a remote datacenter failure to so
drastically affect the primary datacenter's performance in the case of a
failure, so a more performant way of spilling data is required.
Whereas spill-by-value copied the entire mutation into the B-tree and removes
it from the disk queue, spill-by-reference leaves the mutations in the disk
queue and writes a pointer to it into the B-tree. Performance experiments
revealed that the TLog's performance while spilling was dictated more by the
number of writes done to the SQLite B-tree, than by the size of those writes.
Thus, "spill-by-reference" being able to do a significantly better batching
with its writes to the B-tree is more important than that it writes less data
in aggregate. Spill-by-reference significantly reduces the volume of data
written to the B-tree, and the less data that we write, the more we can batch
versions to be written together.
************************************************************************
* DiskQueue *
* *
* ------- Index in B-tree ------- ---- Index in memory ---- *
* / \ / \ *
* +-----------------------------------+-----------------------------+ *
* | Spilled Data | Most Recent Data | *
* +-----------------------------------+-----------------------------+ *
* lowest version highest version *
* *
************************************************************************
Spill-by-reference works by taking a larger range of versions, and building a
single key-value pair per tag that describes where in the disk queue is every
relevant commit for that tag. Concretely, this takes the form
`(tag, last_version) -> [(version, start, end, mutation_bytes), ...]`, where:
* `tag` is the small integer representing the storage server this mutation batch is destined for.
* `last_version` is the last/maximum version contained in the value's batch.
* `version` is the version of the commit that this index entry points to.
* `start` is an index into the disk queue of where to find the beginning of the commit.
* `end` is an index into the disk queue of where the end of the commit is.
* `mutation_bytes` is the number of bytes in the commit that are relevant for this tag.
And then writing only once per tag spilled into the B-tree for each iteration
through spilling. This turns the number of writes into the B-Tree from
`O(tags * versions)` to `O(tags)`.
Note that each tuple in the list represents a commit, and not a mutation. This
means that peeking spilled commits will involve reading all mutations that were
a part of the commit, and then filtering them to only the ones that have the
tag of interest. Alternatively, one could have each tuple represent a mutation
within a commit, to prevent over-reading when peeking. There exist
pathological workloads for each strategy. The purpose of this work is most
importantly to support spilling of log router tags. These exist on every
mutation, so that it will get copied to other datacenters. This is the exact
pathological workload for recording each mutation individually, because it only
increases the number of IO operations used to read the same amount of data.
For a wider set of workloads, there's room to establish a heuristic as to when
to record mutation(s) versus the entire commit, but performance testing hasn't
surfaced this as important enough to include in the initial version of this
work.
Peeking spilled data now works by issuing a range read to the B-tree from
`(tag, peek_begin)` to `(tag, infinity)`. This is why the key contains the
last version of the batch, rather than the beginning, so that a range read from
the peek request's version will always return all relevant batches. For each
batched tuple, if the version is greater than our peek request's version, then
we read the commit containing that mutation from disk, extract the relevant
mutations, and append them to our response. There is a target size of the
response, 150KB by default. As we iterate through the tuples, we sum
`mutation_bytes`, which already informs us how many bytes of relevant mutations
we'll get from a given commit. This allows us to make sure we won't waste disk
IOs on reads that will end up being discarded as unnecessary.
Popping spilled data works similarly to before, but now requires recovering
information from disk. Previously, we would maintain a map from version to
location in the disk queue for every version we hadn't yet spilled. Once
spilling has copied the value into the B-tree, knowing where the commit was in
the disk queue is useless to us, and is removed. In spill-by-reference, that
information is still needed to know how to map "pop until version 7" to "pop
until byte 87" in the disk queue. Unfortunately, keeping this information in
memory would result in TLogs slowly consuming more and more
memory[^versionmap-memory] as more data is spilled. Instead, we issue a range
read of the B-tree from `(tag, pop_version)` to `(tag, infinity)` and look at
the first commit we find with a version greater than our own. We then use its
starting disk queue location as the limit of what we could pop the disk queue
until for this tag.
[^versionmap-memory]: Pessimistic assumptions would suggest that a TLog spilling 1TB of data would require ~50GB of memory to hold this map, which isn't acceptable.
## Detailed Implementation
The rough outline of concrete changes proposed looks like:
1. Allow a new TLog and old TLog to co-exist and be configurable, upgradeable, and recoverable
1. Modify spilling in new TLogServer
1. Modify peeking in new TLogServer
1. Modify popping in new TLogServer
1. Spill txsTag specially
### Configuring and Upgrading
Modifying how transaction logs spill data is a change to the on-disk files of
transaction logs. The work for enabling safe upgrades and rollbacks of
persistent state changes to transaction logs was split off into a seperate
design document: "Forward Compatibility for Transaction Logs".
That document describes a `log_version` configuration setting that controls the
availability of new transaction log features. A similar configuration setting
was created, `log_spill`, that at `log_version>=3`, one may `fdbcli>
configure log_spill:=2` to enable spill-by-reference. Only FDB 6.1 or newer
will be unable to recover transaction log files that were using
spill-by-reference. FDB 6.2 will use spill-by-reference by default.
| FDB Version | Default | Configurable |
|-------------|---------|--------------|
| 6.0 | No | No |
| 6.1 | No | Yes |
| 6.2 | Yes | Yes |
If running FDB 6.1, the full command to enable spill-by-reference is
`fdbcli> configure log_version:=3 log_spill:=2`.
The TLog implementing spill-by-value was moved to `OldTLogServer_6_0.actor.cpp`
and namespaced similarly. `tLogFnForOptions` takes a `TLogOptions`, which is
the version and spillType, and returns the correct TLog implementation
according to those settings. We maintain a map of
`(TLogVersion, StoreType, TLogSpillType)` to TLog instance, so that only
one SharedTLog exists per configuration variant.
### Generations
As a background, each time FoundationDB goes through a recovery, it will
recruit a new generation of transaction logs. This new generation of
transaction logs will often be recruited on the same worker that hosted the
previous generation's transaction log. The old generation of transaction logs
will only shut down once all the data that they have has been fully popped.
This means that there can be multiple instances of a transaction log in the
same process.
Naively, this would create resource issues. Each instance would think that it
is allowed its own 1.5GB buffer of in-memory mutations. Instead, internally to
the TLog implmentation, the transaction log is split into two parts. A
`SharedTLog` is all the data that should be shared across multiple generations.
A TLog is all the data that is private to one generation. Most notably, the
1.5GB mutation buffer and the on-disk files are owned by the `SharedTLog`. The
index for the data added to that buffer is maintained within each TLog. In the
code, a SharedTLog is `struct TLogData`, and a TLog is `struct LogData`.
(I didn't choose these names.)
This background is required, because one needs to keep in mind that we might be
committing in one TLog instance, a different one might be spilling, and yet
another might be the one popping data.
*********************************************************
* SharedTLog *
* *
* +--------+--------+--------+--------+--------+ *
* | TLog 1 | TLog 2 | TLog 3 | TLog 4 | TLog 5 | *
* +--------+--------+--------+--------+--------+ *
* ^ popping ^spilling ^committing *
*********************************************************
Conceptually, this is because each TLog owns a separate part of the same Disk
Queue file. The earliest TLog instance needs to be the one that controls when
the earliest part of the file can be discarded. We spill in version order, and
thus whatever TLog is responsible for the earliest unspilled version needs to
be the one doing the spilling. We always commit the newest version, so the
newest TLog must be the one writing to the disk queue and inserting new data
into the buffer of mutations.
### Spilling
`updatePersistentData()` is the core of the spilling loop, that takes a new
persistent data version, writes the in-memory index for all commits less than
that version to disk, and then removes them from memory. By contact, once
spilling commits an updated persistentDataVersion to the B-tree, then those
bytes will not need to be recovered into memory after a crash, nor will the
in-memory bytes be needed to serve a peek response.
Our new method of spilling iterates through each tag, and builds up a
`vector
Increasing it could increase throughput in spilling regimes.
Decreasing it will decrease how sawtooth-like TLog memory usage is.
`UPDATE_STORAGE_BYTE_LIMIT`
: How many bytes of mutations should be spilled at once in a spill-by-value TLog.
This knob is pre-existing, and has only been "changed" to only apply to spill-by-value.
`TLOG_SPILL_REFERENCE_MAX_BATCHES_PER_PEEK`
: How many batches of spilled data index batches should be read from disk to serve one peek request.
Increasing it will potentially increase the throughput of peek requests.
Decreasing it will decrease the number of read IOs done per peek request.
`TLOG_SPILL_REFERENCE_MAX_BYTES_PER_BATCH`
: How many bytes a batch of spilled data indexes can be.
Increasing it will increase TLog throughput while spilling.
Decreasing it will decrease the latency and increase the throughput of peek requests.
`TLOG_SPILL_REFERENCE_MAX_PEEK_MEMORY_BYTES`
: How many bytes of memory can be allocated to hold the results of reads from disk to respond to peek requests.
Increasing it will increase the number of parallel peek requests a TLog can handle at once.
Decreasing it will reduce TLog memory usage.
If increased, `--max_memory` should be increased by the same amount.
`TLOG_DISK_QUEUE_EXTENSION_BYTES`
: When a DiskQueue needs to extend a file, by how many bytes should it extend the file.
Increasing it will reduce metadata operations done to the drive, and likely tail commit latency.
Decreasing it will reduce allocated but unused space in the DiskQueue files.
Note that this was previously hardcoded to 20MB, and is only being promoted to a knob.
`TLOG_DISK_QUEUE_SHRINK_BYTES`
: If a DiskQueue file has extra space left when switching to the other file, by how many bytes should it be shrunk.
Increasing this will cause disk space to be returned to the OS faster.
Decreasing this will decrease TLog tail latency due to filesystem metadata updates.
## Observability
With the new changes, we must ensure that sufficent information has been exposed such that:
1. If something goes wrong in production, we can understand what and why from trace logs.
2. We can understand if the TLog is performing suboptimally, and if so, which knob we should change and by how much.
The following metrics were added to `TLogMetrics`:
### Spilling
### Peeking
`PeekMemoryRequestsStalled`
: The number of peek requests that are blocked on acquiring memory for reads.
`PeekMemoryReserved`
: The amount of memory currently reserved for serving peek requests.
### Popping
`QueuePoppedVersion`
: The oldest version that's still useful.
`MinPoppedTagLocality`
: The locality of the tag that's preventing the DiskQueue from being further popped.
`MinPoppedTagId`
: The id of the tag that's preventing the DiskQueue from being further popped.
## Monitoring and Alerting
To answer questions like:
1. What new graphs should exist?
2. What old graphs might exist that would no longer be meaningful?
3. What alerts might exist that need to be changed?
4. What alerts should be created?
Of which I'm aware of:
* Any current alerts on "Disk Queue files more than [constant size] GB" will need to be removed.
* Any alerting or monitoring of `log*.sqlite` as an indication of spilling will no longer be effective.
* A graph of `BytesInput - BytesPopped` will give an idea of the number of "active" bytes in the DiskQueue file.