mirror of
https://github.com/facebook/rocksdb.git
synced 2025-04-20 03:20:00 +08:00
Summary: # Problem Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly `Refresh'ed`, the iterator will hold on to the `Init`-time assigned `SuperVersion` throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (_potentially_ referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of _logical_ backups (outside of RocksDB `BackupEngine`). # Solution Building on top of the `Refresh(const Snapshot* snapshot)` API introduced in https://github.com/facebook/rocksdb/pull/10594, we're adding a new `ReadOptions` opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in `read_options.snapshot` at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13354 Test Plan: **Correctness:** New test to demonstrate the auto refresh behavior in contrast to legacy iterator: `./db_iterator_test --gtest_filter=*AutoRefreshIterator*`. **Stress testing:** We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in `db_stress` as we reasonably can with random feature on/off configuration in db_crashtest.py. # Benchmarking The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build, 3-5 times for each respective category or till convergence. In addition, we configured aggressive threshold of 1 second for new `Superversion` creation. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1]. **TLDR;** Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) in throughput relative to the baseline, if any, is no more than 1-2%. **Snapshot initialization approach:** This feature is only effective on iterators with well-defined `snapshot` passed via `ReadOptions` config. We modified the existing `db_bench` program to reflect that constraint. However, it quickly turned out that the actual `Snapshot*` initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking _potentially_ adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' `Snapshot*` for all the future scans _before_ running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be. ## "DB in memory" benchmark **DB Setup** 1. Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected `000004.log` file is 541MB in size). ``` ./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000 max_write_buffer_number=2 -compression_type=none ``` 2. As a part of recovery in subsequent DB open, WAL will be processed to one or more SST files during the recovery. We're selecting a large block cache (`cache_size` parameter in `db_bench` script) suitable for holding the entire DB to test the “hot path” CPU overhead. ``` ./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=1 -async_io=1 -num=? -seek_nexts=? -cache_size=? -write_buffer_size=? -auto_refresh_iterator_with_snapshot={0|1} ``` | seek_nexts=100; num=2,000,000 | seek_nexts = 20,000; num=50000 | seek_nexts = 400,000; num=2000 -- | -- | -- | -- baseline | 36362 (± 300) ops/sec, 928.8 (± 23) MB/s, 99.11% block cache hit | 52.5 (± 0.5) ops/sec, 1402.05 (± 11.85) MB/s, 99.99% block cache hit | 156.2 (± 6.3) ms / op, 1330.45 (± 54) MB/s, 99.95% block cache hit auto refresh | 35775.5 (± 537) ops/sec, 926.65 (± 13.75) MB/s, 99.11% block cache hit | 53.5 (± 0.5) ops/sec, 1367.9 (± 9.5) MB/s, 99.99% block cache hit | 162 (± 4.14) ms / op, 1281.35 (± 32.75) MB/s, 99.95% block cache hit _-cache_size=5000000000 -write_buffer_size=3200000000 -max_write_buffer_number=2_ | seek_nexts=3,500,000; num=100 -- | -- baseline | 1447.5 (± 34.5) ms / op, 1255.1 (± 30) MB/s, 98.98% block cache hit auto refresh | 1473.5 (± 26.5) ms / op, 1232.6 (± 22.2) MB/s, 98.98% block cache hit _-cache_size=17680000000 -write_buffer_size=14500000000 -max_write_buffer_number=2_ | seek_nexts=17,500,000; num=10 -- | -- baseline | 9.11 (± 0.185) s/op, 997 (± 20) MB/s auto refresh | 9.22 (± 0.1) s/op, 984 (± 11.4) MB/s [1] ### Specs | Property | Value -- | -- RocksDB | version 10.0.0 Date | Mon Feb 3 23:21:03 2025 CPU | 32 * Intel Xeon Processor (Skylake) CPUCache | 16384 KB Keys | 16 bytes each (+ 0 bytes user-defined timestamp) Values | 100 bytes each (50 bytes after compression) Prefix | 0 bytes RawSize | 5.5 MB (estimated) FileSize | 3.1 MB (estimated) Compression | Snappy Compression sampling rate | 0 Memtablerep | SkipListFactory Perf Level | 1 Reviewed By: pdillinger Differential Revision: D69122091 Pulled By: mszeszko-meta fbshipit-source-id: 147ef7c4fe9507b6fb77f6de03415bf3bec337a8
148 lines
5.3 KiB
C++
148 lines
5.3 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
#include <stdint.h>
|
|
|
|
#include <string>
|
|
|
|
#include "db/db_impl/db_impl.h"
|
|
#include "db/db_iter.h"
|
|
#include "db/range_del_aggregator.h"
|
|
#include "memory/arena.h"
|
|
#include "options/cf_options.h"
|
|
#include "rocksdb/db.h"
|
|
#include "rocksdb/iterator.h"
|
|
#include "util/autovector.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
class Arena;
|
|
class Version;
|
|
|
|
// A wrapper iterator which wraps DB Iterator and the arena, with which the DB
|
|
// iterator is supposed to be allocated. This class is used as an entry point of
|
|
// a iterator hierarchy whose memory can be allocated inline. In that way,
|
|
// accessing the iterator tree can be more cache friendly. It is also faster
|
|
// to allocate.
|
|
// When using the class's Iterator interface, the behavior is exactly
|
|
// the same as the inner DBIter.
|
|
class ArenaWrappedDBIter : public Iterator {
|
|
public:
|
|
~ArenaWrappedDBIter() override {
|
|
if (db_iter_ != nullptr) {
|
|
db_iter_->~DBIter();
|
|
} else {
|
|
assert(false);
|
|
}
|
|
}
|
|
|
|
// Get the arena to be used to allocate memory for DBIter to be wrapped,
|
|
// as well as child iterators in it.
|
|
virtual Arena* GetArena() { return &arena_; }
|
|
|
|
const ReadOptions& GetReadOptions() { return read_options_; }
|
|
|
|
// Set the internal iterator wrapped inside the DB Iterator. Usually it is
|
|
// a merging iterator.
|
|
virtual void SetIterUnderDBIter(InternalIterator* iter) {
|
|
db_iter_->SetIter(iter);
|
|
}
|
|
|
|
void SetMemtableRangetombstoneIter(
|
|
std::unique_ptr<TruncatedRangeDelIterator>* iter) {
|
|
memtable_range_tombstone_iter_ = iter;
|
|
}
|
|
|
|
bool Valid() const override { return db_iter_->Valid(); }
|
|
void SeekToFirst() override { db_iter_->SeekToFirst(); }
|
|
void SeekToLast() override { db_iter_->SeekToLast(); }
|
|
// 'target' does not contain timestamp, even if user timestamp feature is
|
|
// enabled.
|
|
void Seek(const Slice& target) override {
|
|
MaybeAutoRefresh(true /* is_seek */, DBIter::kForward);
|
|
db_iter_->Seek(target);
|
|
}
|
|
|
|
void SeekForPrev(const Slice& target) override {
|
|
MaybeAutoRefresh(true /* is_seek */, DBIter::kReverse);
|
|
db_iter_->SeekForPrev(target);
|
|
}
|
|
|
|
void Next() override {
|
|
db_iter_->Next();
|
|
MaybeAutoRefresh(false /* is_seek */, DBIter::kForward);
|
|
}
|
|
|
|
void Prev() override {
|
|
db_iter_->Prev();
|
|
MaybeAutoRefresh(false /* is_seek */, DBIter::kReverse);
|
|
}
|
|
|
|
Slice key() const override { return db_iter_->key(); }
|
|
Slice value() const override { return db_iter_->value(); }
|
|
const WideColumns& columns() const override { return db_iter_->columns(); }
|
|
Status status() const override { return db_iter_->status(); }
|
|
Slice timestamp() const override { return db_iter_->timestamp(); }
|
|
bool IsBlob() const { return db_iter_->IsBlob(); }
|
|
|
|
Status GetProperty(std::string prop_name, std::string* prop) override;
|
|
|
|
Status Refresh() override;
|
|
Status Refresh(const Snapshot*) override;
|
|
|
|
bool PrepareValue() override { return db_iter_->PrepareValue(); }
|
|
|
|
void Init(Env* env, const ReadOptions& read_options,
|
|
const ImmutableOptions& ioptions,
|
|
const MutableCFOptions& mutable_cf_options, const Version* version,
|
|
const SequenceNumber& sequence,
|
|
uint64_t max_sequential_skip_in_iterations, uint64_t version_number,
|
|
ReadCallback* read_callback, ColumnFamilyHandleImpl* cfh,
|
|
bool expose_blob_index, bool allow_refresh);
|
|
|
|
// Store some parameters so we can refresh the iterator at a later point
|
|
// with these same params
|
|
void StoreRefreshInfo(ColumnFamilyHandleImpl* cfh,
|
|
ReadCallback* read_callback, bool expose_blob_index) {
|
|
cfh_ = cfh;
|
|
read_callback_ = read_callback;
|
|
expose_blob_index_ = expose_blob_index;
|
|
}
|
|
|
|
private:
|
|
void DoRefresh(const Snapshot* snapshot, uint64_t sv_number);
|
|
void MaybeAutoRefresh(bool is_seek, DBIter::Direction direction);
|
|
|
|
DBIter* db_iter_ = nullptr;
|
|
Arena arena_;
|
|
uint64_t sv_number_;
|
|
ColumnFamilyHandleImpl* cfh_ = nullptr;
|
|
ReadOptions read_options_;
|
|
ReadCallback* read_callback_;
|
|
bool expose_blob_index_ = false;
|
|
bool allow_refresh_ = true;
|
|
// If this is nullptr, it means the mutable memtable does not contain range
|
|
// tombstone when added under this DBIter.
|
|
std::unique_ptr<TruncatedRangeDelIterator>* memtable_range_tombstone_iter_ =
|
|
nullptr;
|
|
};
|
|
|
|
// Generate the arena wrapped iterator class.
|
|
// `cfh` is used for reneweal. If left null, renewal will not
|
|
// be supported.
|
|
ArenaWrappedDBIter* NewArenaWrappedDbIterator(
|
|
Env* env, const ReadOptions& read_options, const ImmutableOptions& ioptions,
|
|
const MutableCFOptions& mutable_cf_options, const Version* version,
|
|
const SequenceNumber& sequence, uint64_t max_sequential_skip_in_iterations,
|
|
uint64_t version_number, ReadCallback* read_callback,
|
|
ColumnFamilyHandleImpl* cfh = nullptr, bool expose_blob_index = false,
|
|
bool allow_refresh = true);
|
|
} // namespace ROCKSDB_NAMESPACE
|