13184 Commits

Author SHA1 Message Date
Jay Huh
cc487ba367 Fix Compaction Stats for Remote Compaction and Tiered Storage (#13464)
Summary:
## Background

Compaction statistics are collected at various levels across different classes and structs.

* `InternalStats::CompactionStats`: Per-level Compaction Stats within a job (can be at subcompaction level which later get aggregated to the compaction level)
* `InternalStats::CompactionStatsFull`: Contains two per-level compaction stats - `output_level_stats` for primary output level stats and `proximal_level_stats` for proximal level stats. Proximal level statistics are only relevant when using Tiered Storage with the per-key placement feature enabled.
* `InternalStats::CompactionOutputsStats`: Simplified version of `InternalStats::CompactionStats`. Only has a subset of fields from `InternalStats::CompactionStats`
* `CompactionJobStats`: Job-level Compaction Stats. (can be at subcompaction level which later get aggregated to the compaction level)

Please note that some fields in Job-level stats are not in Per-level stats and they don't map 1-to-1 today.

## Issues

* In non-remote compactions, proximal level compaction statistics were not being aggregated into job-level statistics. Job level statistics were missing stats for proximal level for tiered storage compactions with per-key-replacement feature enabled.
* During remote compactions, proximal level compaction statistics were pre-aggregated into job-level statistics on the remote side. However, per-level compaction statistics were not part of the serialized compaction result, so that primary host lost that information and weren't able to populate `per_key_placement_comp_stats_` and `internal_stats_.proximal_level_stats` properly during the installation.
* `TieredCompactionTest` was only checking if (expected stats > 0 && actual stats > 0) instead actual value comparison

## Fixes

* Renamed `compaction_stats_` to `internal_stats_` for `InternalStats::CompactionStatsFull` in `CompactionJob` for better readability
* Removed the usage of `InternalStats::CompactionOutputsStats` and consolidated them to `InternalStats::CompactionStats`.
* Remote Compactions now include the internal stats in the serialized `CompactionServiceResult`. `output_level_stats` and `proximal_level_stats` get later propagated in sub_compact output stats accordingly.
* `CompactionJob::UpdateCompactionJobStats()` now takes `CompactionStatsFull` and aggregates the `proximal_level_stats` as well
* `TieredCompactionTest` is now doing the actual value comparisons for input/output file counts and record counts. Follow up is needed to do the same for the bytes read / written.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13464

Test Plan:
Unit Tests updated to verify stats

```
./compaction_service_test
```
```
./tiered_compaction_test
```

Reviewed By: pdillinger

Differential Revision: D71220393

Pulled By: jaykorean

fbshipit-source-id: ad70bffd9614ced683f90c7570a17def9b5c8f3f
2025-03-18 16:28:18 -07:00
Yu Zhang
17ac19f2c4 Add a check during recovery for proper seqno advancement (#13465)
Summary:
This PR adds a check for an invariant of sequence number during recovery, that it should not be set backward. This is inspired by a recent SEV that is caused by a software bug. It is a relatively cheap and straightforward check that RocksDB can do to avoid silently opening the DB in a corrupted state.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13465

Test Plan:
Existing tests should cover the case when the invariant is met

The corrupted state is manually tested using aforementioned bug.

Reviewed By: hx235

Differential Revision: D71226513

Pulled By: jowlyzhang

fbshipit-source-id: cd8056fa6653d44ceeb9ba9b4693ab0660a53b4e
2025-03-17 12:49:10 -07:00
Hui Xiao
24952ff088 Expose number of L0 files in the CF right before the compaction starts in CompactionJobInfo (#13462)
Summary:
**Context/Summary:**
For users who are interested in knowing how efficient their compaction in reducing L0 files or how bad their long-running compaction in "locking" L0 files, they now have a reference point "L0 files in the CF pre compaction" for their input compaction files.
- Compared to the existing stats or exposing in some other way, exposing this info in CompactionJobInfo allows users to compare it with other compaction data (e.g, compaction input num, compaction reason) of within **one** compaction (of per-compaction granularity).
- If this number is high while their "short-running" compaction has little L0 files input, then those compaction may have a room for improvement. Similar for those long-running compaction. This PR is to add a new field `CompactionJobInfo::num_l0_files_pre_compaction` for that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13462

Test Plan: - Piggyback on an existing test

Reviewed By: jaykorean

Differential Revision: D71124938

Pulled By: hx235

fbshipit-source-id: aa47c9c86c62d9425771b320f5636e50671fd289
2025-03-17 11:11:44 -07:00
Maciej Szeszko
6ac13a5f0a Expose WriteLifeTimeHint at the FileOptions level (#13461)
Summary:
The original implementation of NVMe write lifetime hints (https://github.com/facebook/rocksdb/pull/3095) assumed a flexible interface which decouples file creation from the explicit act of setting write lifetime hint (see `PosixWritableFile` for more context). However, there are existing file systems implementations (ex. Warm Storage) that require all the options (including file write lifetime hints) to be specified once at the time of the actual `FSWritableFile` object instantiation. We're extending the `FileOptions` with `Env::WriteLifeTimeHint` and patch existing callsites accordingly to enable one-shot metadata setup for those more constraint implementations.

NOTE: Today `CalculateSSTWriteHint` only sets write lifetime hint for Level compactions. We'll fill that gap in following PRs and add calculation for Universal Compactions which would unblock Zippy's use case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13461

Reviewed By: anand1976

Differential Revision: D71144645

Pulled By: mszeszko-meta

fbshipit-source-id: 6c09b62a360d48bd6e4fb08a1265bce2a49f3f4a
2025-03-14 21:43:50 -07:00
Peter Dillinger
0cc943c067 format_version < 2 unsupported for write, deprecated for read (#13463)
Summary:
In hopes of eventually removing some ugly and awkard code for compress_format_version < 2, users can no longer write files in that format and its read support is marked deprecated. For continuing to test that read support, there is a back door to writing the files in unit tests.

If format_version < 2 is specified, it is quietly sanitized to 2. (This is similar to other BlockBasedTableOptions.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13463

Test Plan: unit tests updated.

Reviewed By: hx235

Differential Revision: D71152916

Pulled By: pdillinger

fbshipit-source-id: 95be55e86f93f09fd898223578b9381385c3ccd8
2025-03-14 10:50:05 -07:00
Jay Huh
ca7367a003 Replace penultimate naming with proximal (#13460)
Summary:
With generalized age-based tiering (work-in-progress), the "warm tier" data will no longer necessarily be placed in the second-to-last level (also known as the "penultimate level").

Also, the cold tier may no longer necessarily be at the last level, so we need to rename options like `preclude_last_level_seconds` to `preclude_cold_tier_seconds`, but renaming options is trickier because it can be a breaking change for consuming applications. We will do this later as a follow up.

**Minor fix included**: Fixed one `use-after-move` in CompactionPicker

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13460

Test Plan: CI

Reviewed By: pdillinger

Differential Revision: D71059486

Pulled By: jaykorean

fbshipit-source-id: fd360cdf719e015bf9f9e3f6f1663438226566a4
2025-03-12 18:24:28 -07:00
Jay Huh
c5921df3d7 Add PerKeyPlacement support (#13459)
Summary:
This PR adds support for PerKeyPlacement in Remote Compaction.

The `seqno_to_time_mapping` is already available from the table properties of the input files. `preserve_internal_time_seconds` and `preclude_last_level_data_seconds` are directly read from the OPTIONS file upon db open in the remote worker. The necessary changes include:

- Add `is_penultimate_level_output` and `file_temperature` to the `CompactionServiceOutputFile`
- When building the output for the remote compaction, get the outputs for penultimate level and last level separately, serialize them with the two additional information added in this PR.
- When deserializing the result from the primary, SubcompactionState's `GetOutputs()` now takes `is_penultimate_level`. This allows us to determine which level to place the output file.
- Include stats from `compaction_stats.penultimate_level_stats` in the remote compaction result

# To Follow up
- Stats to be fixed. Stats are not being populated correctly for PerKeyPlacement even for non-remote compactions.
- Clean up / Reconcile the "penultimate" naming by replacing with "proximal"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13459

Test Plan:
Updated the unit test

```
./compaction_service_test
```

Reviewed By: pdillinger

Differential Revision: D71007211

Pulled By: jaykorean

fbshipit-source-id: f926e56df17239875d849d46b8b940f8cd5f1825
2025-03-12 11:46:02 -07:00
Maciej Szeszko
8e16f8fecf Reduce db stress noise (#13447)
Summary:
[Experiment]

This PR is a followup to https://github.com/facebook/rocksdb/pull/13408. Thick bandaid of ignoring all injected read errors in context of periodic iterator auto refreshes in db stress proved to be effective. We confirmed our theory that errors are not a really a consequence / defect related to this new feature but rather due to subtle ways in which downstream code paths handle their respective IO failures. In this change we're replacing a thick 'ignore all IO read errors' bandaid in `no_batched_ops_stress` with a much smaller, targeted patches in obsolete files purge / delete codepaths, table block cache reader, table cache lookup to make sure we don't miss signal and ensure there's a single mechanism for ignoring error injection in db stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13447

Reviewed By: hx235

Differential Revision: D70794787

Pulled By: mszeszko-meta

fbshipit-source-id: c5fcd4780d82357c407f53bf0bb22fc38f7bd277
2025-03-12 01:13:40 -07:00
Jay Huh
22ca6e5e68 Additional debug logging for InputFileCheck Failure (#13452)
Summary:
Add debug logging when the Wait() does not return `kSuccess` so that we can compare the version state that was printed by the logging added in https://github.com/facebook/rocksdb/issues/13427 upon InputFileCheck failure.

# Test Plan

CI + Tested with Temporary Change in Meta Internal Infra

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13452

Reviewed By: hx235

Differential Revision: D70898963

Pulled By: jaykorean

fbshipit-source-id: d591b82f2df173b5e01f6552230844ce95155256
2025-03-10 13:37:47 -07:00
Richard Barnes
60c266658d Use nullptr in infra_asic_fpga/ip/mtia/athena/main/models/cmodel/util/jsonUtils.cpp
Summary:
`nullptr` is preferable to `0` or `NULL`. Let's use it everywhere so we can enable `-Wzero-as-null-pointer-constant`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: dtolnay

Differential Revision: D70818166

fbshipit-source-id: 4658fb004676fe2686249fdd8ecb322dec8aa63d
2025-03-09 11:18:56 -07:00
Peter Dillinger
b9c7481fc2 Fix some secondary/read-only DB logic (#13441)
Summary:
Primarily, fix an issue from https://github.com/facebook/rocksdb/issues/13316 with opening secondary DB with preserve/preclude option (crash test disable in https://github.com/facebook/rocksdb/issues/13439). The issue comes down to mixed-up interpretations of "read_only" which should now be resolved. I've introduced the stronger notion of "unchanging" which means the VersionSet never sees any changes to the LSM tree, and the weaker notion of "read_only" which means LSM tree changes are not written through this VersionSet/etc. but can pick up externally written changes. In particular, ManifestTailer should use read_only=true (along with unchanging=false) for proper handling of preserve/preclude options.

A new assertion in VersionSet::CreateColumnFamily to help ensure sane usage of the two boolean flags is incompatible with the known wart of allowing CreateColumnFamily on a read-only DB. So to keep that assertion, I have fixed that issue by disallowing it. And this in turn required downstream clean-up in ldb, where I cleaned up some call sites as well.

Also, rename SanitizeOptions for ColumnFamilyOptions to SanitizeCfOptions, for ease of search etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13441

Test Plan:
* Added preserve option to a test in db_secondary_test, which reproduced the failure seen in the crash test.
* Revert https://github.com/facebook/rocksdb/issues/13439 to re-enable crash test functionality
* Update some tests to deal with disallowing CF creation on read-only DB
* Add some testing around read-only DBs and CreateColumnFamily(ies)
* Resurrect a nearby test for read-only DB to be sure it doesn't write to the DB dir. New EnforcedReadOnlyReopen should probably be used in more places but didn't want to attempt a big migration here and now. (Suggested follow-up.)

Reviewed By: jowlyzhang

Differential Revision: D70808033

Pulled By: pdillinger

fbshipit-source-id: 486b4e9f9c9045150a0ebb9cb302753d03932a3f
2025-03-07 14:56:45 -08:00
Peter Dillinger
5d1c0a8832 Reformat assertion in TEST_VerifyNoObsoleteFilesCached (#13446)
Summary:
... for better automatic failure grouping

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13446

Test Plan: no production code change

Reviewed By: hx235

Differential Revision: D70789464

Pulled By: pdillinger

fbshipit-source-id: 68263f6ed666349d65b5f493865973a213f35ec9
2025-03-07 11:25:44 -08:00
Jay Huh
d033c6a849 set ignore_unknown_options when parsing options (#13443)
Summary:
In case the primary host has a new option added which isn't available in the remote worker yet, the remote compaction currently fails. In most cases, these new options are not relevant to the remote compaction and the worker should be able to move on by ignoring it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13443

Test Plan: Verified internally in Meta Infra.

Reviewed By: anand1976

Differential Revision: D70744359

Pulled By: jaykorean

fbshipit-source-id: eb6a388c2358a7f8089f2e35a378b7017b9e03f3
2025-03-06 17:26:37 -08:00
Jay Huh
68b2d941be Introduce kAborted Status (#13438)
Summary:
If compaction job needs to be aborted inside `Schedule()` or `Wait()` today (e.g. Primary host is shutting down), the only two options are the following
- Handle it as failure by returning `CompactionServiceJobStatus::kFailure`
- Return `CompactionServiceJobStatus::kUseLocal` and let the compaction move on locally and eventually succeed or fail depending on the timing

In this PR, we are introducing a new status, `CompactionServiceJobStatus::kAborted`,  so that the implementation of `Schedule()` and `Wait()` can return it. Just like how `CompactionServiceJobStatus::kFailure` is handled, compaction will not move on and fail, but the status will be returned as `Status::Aborted()` instead of `Status::Incomplete()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13438

Test Plan:
Unit Test added
```
 ./compaction_service_test --gtest_filter="*CompactionServiceTest.AbortedWhileWait*"
```

Reviewed By: anand1976, hx235

Differential Revision: D70655355

Pulled By: jaykorean

fbshipit-source-id: 22614ce9c7455cda649b15465625edc93978fe11
v9.11.1
2025-03-05 22:15:17 -08:00
Andrew Chang
8e6d431153 Add IOActivityToString helper method (#13440)
Summary:
I have a place I want to use this helper method inside the Sally codebase. I have this functionality in my Sally diff right now, but I think it is generic enough to warrant putting alongside `Env::PriorityToString`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13440

Test Plan: Just the compiler and CI checks are sufficient IMO.

Reviewed By: hx235

Differential Revision: D70664597

Pulled By: archang19

fbshipit-source-id: 341de6c6e311a3f421ad093c2c216e5caa5034dd
2025-03-05 19:07:01 -08:00
anand76
14c949df8b Initial implementation of ExternalTableBuilder (#13434)
Summary:
This PR adds the ability to use an ExternalTableBuilder through the SstFileWriter to create external tables. This is a counterpart to https://github.com/facebook/rocksdb/issues/13401 , which adds the ExternalTableReader. The support for external tables is confined to ingestion only DBs, with external table files ingested into the bottommost level only. https://github.com/facebook/rocksdb/issues/13431 enforces ingestion only DBs by adding a disallow_memtable_writes column family option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13434

Test Plan: New unit tests in table_test.cc

Reviewed By: pdillinger

Differential Revision: D70532054

Pulled By: anand1976

fbshipit-source-id: a837487eadfabed9627a0eceb403bfc5fc2c427c
2025-03-05 16:30:46 -08:00
anand76
f6bff87b92 Add opaque options in ReadOptions for external tables (#13436)
Summary:
Add an unordered_map of name/value pairs in ReadOptions::property_bag, similar to IOOptions::property_bag. It allows users to pass through some custom options to an external table.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13436

Reviewed By: jaykorean

Differential Revision: D70649609

Pulled By: anand1976

fbshipit-source-id: 9b14806a9f3599b861827bd4ae6e948861edc51a
2025-03-05 16:25:41 -08:00
Peter Dillinger
ec8f1452f5 Temp disable in crash test: secondary instance + seqno-time tracking (#13439)
Summary:
PR https://github.com/facebook/rocksdb/issues/13316 broke some crash test cases in DBImplSecondary, from combining test_secondary=1 and preserve_internal_time_seconds>0. Disabling that while investigating the fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13439

Test Plan: manual blackbox_crash_test runs with forced test_secondary=1

Reviewed By: anand1976

Differential Revision: D70656373

Pulled By: pdillinger

fbshipit-source-id: fa2139e90bbe64ec8ebb062877d9337894ea3b43
2025-03-05 14:32:05 -08:00
Peter Dillinger
15873b1fdd New CF option disallow_memtable_writes (#13431)
Summary:
... to better support "ingestion only" column families such as those using an external file reader as in https://github.com/facebook/rocksdb/issues/13401.

It would be possible to implement this by getting rid of the memtable for that CF, but it quickly because clear that such an approach would need to update a lot of places to deal with such a possibility. And we already have logic to optimize reads when a memtable is empty. We put a vector memtable in place to minimize overheads of an empty memtable.

There are three layers of defense against writes to the memtable:
* WriteBatch ops to a disallowed CF will fail immediately, without waiting for Write(). For this check to work, we need a ColumnFamilyHandle and because of that, we don't support disallow_memtable_writes on the default column family.
* MemtableInserter will reject writes to disallowed CFs. This is needed to protect re-open with disallow when there are existing writes in a WAL.
* The placeholder memtable is marked immutable. This will cause an assertion failure on attempt to write, such as in case of bug or regression.

Suggested follow-up:
* Remove the limitation on using the option with the default column family, perhaps by solving https://github.com/facebook/rocksdb/issues/13429 more generally or perhaps with some specific check before the first memtable write of the batch (but potential CPU overhead for such a check - there's likely optimization opportunities around ColumnFamilyMemTables).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13431

Test Plan:
unit tests added

Performance: A db_bench call designed to realistically focus on the CPU cost of writes:

```
./db_bench -db=/dev/shm/dbbench1 --benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -num_column_families=20 -disable_wal -write_buffer_size=1234000
```

Running before & after tests at the same time on the same machine, 40 iterations each, average ops/s, DEBUG_LEVEL=0, remove slowest run of each:
Before: 772466
After: 773785 (0.2% faster)

Likely within the noise, as if there was any change, we would expect a slight regression.

Reviewed By: anand1976

Differential Revision: D70495936

Pulled By: pdillinger

fbshipit-source-id: 306f7e737f87c1fbb52c5805f3cadb6e8ced9b40
2025-03-04 18:33:52 -08:00
Peter Dillinger
da8eba8b49 Improve consistency of SeqnoToTime tracking in SuperVersion (#13316)
Summary:
This is an unexpectedly complex follow-up to https://github.com/facebook/rocksdb/issues/13269.

This change solves (and detects regressed) inconsistencies between whether a CF's SuperVersion is configured with a preserve/preclude option and whether it gets a usable SeqnoToTimeMapping. Operating with preserve/preclude and no usable mapping is degraded functionality we need to avoid. And no mapping is useful for actually disabling the feature (except with respect to existing SST files, but that's less of a concern for now).

The challenge is that how we maintain the DB's SeqnoToTimeMapping can depend on all the column families, and we don't want to iterate over all column families *for each column family* (e.g. on initially creating each). The existing code was a bit relaxed:
* On initially creating or re-configuring a CF, we might install an empty mapping, but soon thereafter (after releasing and re-acquiring the DB mutex) re-install another SuperVersion with a useful mapping.

The solution here is to refactor the logic so that there's a distinct but related workflow for (a) ensuring a quality set of mappings when we might only be considering a single CF (`EnsureSeqnoToTimeMapping()`), and (b) massaging that set of mappings to account for all CFs (`RegisterRecordSeqnoTimeWorker`) which doesn't need to re-install new SuperVersions because each CF already has good mappings and will get updated SuperVersions when the periodic task adds new mappings. This should eliminate the extra SuperVersion installs associated with preserve/preclude on CF creation or re-configure, making it the same as any other CF.

Some more details:
* Some refactorings such as removing new_seqno_to_time_mapping from SuperVersionContext. (Now use parameter instead of being stateful.)
* Propagate `read_only` aspect of DB to more places so that we can pro-actively disable preserve/preclude on read-only DBs, so that we don't run afoul of the assertion expecting SeqnoToTime entries.
* Introduce a utility struct `MinAndMaxPreserveSeconds` for aggregating preserve/preclude settings in a useful way, sometimes on one CF and sometimes across multiple CFs. Much cleaner! (IMHO)
* Introduce a function `InstallSuperVersionForConfigChange` that is a superset of `InstallSuperVersionAndScheduleWork` for when a CF is new or might have had a change to its mutable options.
* Eliminate redundant re-install SuperVersions of created "missing" CFs in DBImpl::Open.

Intended follow-up:
* Ensure each flush has an "upper bound" SeqnoToTime entry, which would resolve a FIXME in tiered_compaction_test, but causes enough test churn to deserve its own PR + investigation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13316

Test Plan:
This change is primarily validated by a new assertion in SuperVersion::Init to ensure consistency between (a) presence of any SeqnoToTime mappings in the SuperVersion and (b) preserve/preclude option being currently set.

One unit test update was needed because we now ensure at least one SeqnoToTime entry is created on any DB::Open with preserve/preclude, so that there is a lower bound time on all the future data writes. This required a small hack in associating the time with Seqno 1 instead of 0, which is reserved for "unspecified old."

Reviewed By: cbi42

Differential Revision: D70540638

Pulled By: pdillinger

fbshipit-source-id: bb419fdbeb5a1f115fc429c211f9b8efaf2f56d7
2025-03-04 17:44:01 -08:00
Nicolas De Carli
5f9b7ccce3 Add ROCKSDB_AUXV_GETAUXVAL_PRESENT flag to defs.bzl (#13435)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13435

We've noticed the default CRC32c function gets executed when running on aarch64 cpus within our servers

Issue is that ROCKSDB_AUXV_GETAUXVAL_PRESENT evaluates to false

This fix enables the flag internally and reverts the previous fix, landed with D70423483

Reviewed By: pdillinger

Differential Revision: D70584250

fbshipit-source-id: 28e41316187c474fdfaf854f301ad14b6721fcad
2025-03-04 16:51:19 -08:00
Sean Ovens
0c7e5bd2f0 Shrink size of HashSkipList buckets from 56B to 48B (#13424)
Summary:
Previous order of fields in SkipList:

`const uint16_t kMaxHeight_;  // 2B`
`const uint16_t kBranching_;  // 2B`
`const uint32_t kScaledInverseBranching_;  // 4B`
`Comparator const compare_;  // 8B`
`Allocator* const allocator_;  // 8B`
`Node* const head_;  // 8B`
`std::atomic<int> max_height_;  // 4B`
`// 4B padding added automatically for alignment`
`Node** prev_;  // 8B`
`int32_t prev_height_;  // 4B`
`// 4B padding added automatically for alignment`

= 56B in total. By swapping prev_ and prev_height_, we get the following:

`const uint16_t kMaxHeight_;  // 2B`
`const uint16_t kBranching_;  // 2B`
`const uint32_t kScaledInverseBranching_;  // 4B`
`Comparator const compare_;  // 8B`
`Allocator* const allocator_;  // 8B`
`Node* const head_;  // 8B`
`std::atomic<int> max_height_;  // 4B`
`int32_t prev_height_;  // 4B`
`Node** prev_;  // 8B`

= 48B in total. So this change saves 8B per SkipList object. When allocated using AllocateAligned (as is the case for the [hash skiplist](https://github.com/facebook/rocksdb/blob/main/memtable/hash_skiplist_rep.cc#L243)) and assuming alignof(std::max_align_t) = 16, this change saves an additional 8B per SkipList object (so 16B in total).

Note: this does not affect the "skiplist" memtable, which internally uses InlineSkipList

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13424

Reviewed By: cbi42

Differential Revision: D70423252

Pulled By: pdillinger

fbshipit-source-id: 450dcc7f0e9e86cd3481f6930e83eea5fef78b97
2025-03-03 21:25:29 -08:00
Changyu Bi
7e272d2032 Update MultiGet to provide consistent CF view for kPersistedTier (#13433)
Summary:
when reading with ReadOptions::read_tier = kPersistedTier and with a snapshot, MultiGet allows the case where some CF is read before a flush and some CF is read after the flush. This is not desirable, especially when atomic_flush is enabled and users use MultiGet to do some consistency checks on the data in SST files. This PR updates the code path for SuperVersion acquisition to get a consistent view across when kPersistedTier is used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13433

Test Plan: a new unit test that could be flaky without this change.

Reviewed By: jaykorean

Differential Revision: D70509688

Pulled By: cbi42

fbshipit-source-id: 80de96f94407af9bb2062b6a185c61f65827c092
2025-03-03 15:21:10 -08:00
Nicolas De Carli
1d6c33d2a5 Enable hardware accelerated crc32c for ARM on Linux (#13432)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13432

We've noticed the default CRC32c function gets executed when running on aarch64 cpus within our servers

Issue is that ROCKSDB_AUXV_GETAUXVAL_PRESENT evaluates to false

This fix allows the usage of hardware-accelerated crc32 within our fleet

Reviewed By: jaykorean

Differential Revision: D70423483

fbshipit-source-id: 601da3fbf156e3e40695eb76ee5d37f67f83d427
2025-03-02 08:05:21 -08:00
Peter Dillinger
ebaeb03648 Write failure can be permanently fatal and break WriteBatch atomicity (#13428)
Summary:
This adds a test that attempts DeleteRange() with PlainTable (not supported) and shows that it not only puts the DB in failed write mode, it (a) breaks WriteBatch atomicity for readers, because they can see just part of a failed WriteBatch, and (b) makes the DB not recoverable (without manual intervention) if using WAL.

Note: WriteBatch atomicity is not clearly documented but indicated at the top of write_batch.h and the wiki page for Transactions, even without Transactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13428

Test Plan: this is the test

Reviewed By: anand1976

Differential Revision: D70332226

Pulled By: pdillinger

fbshipit-source-id: 67bc4de68833a80578e48baa9d3a4f23f1600f3c
2025-02-27 11:37:56 -08:00
Jay Huh
d1f383b8eb Add Logging for debugging InputFileCheck Failure (#13427)
Summary:
Add detailed log for debugging purpose

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13427

Test Plan: CI

Reviewed By: cbi42, hx235

Differential Revision: D70274613

Pulled By: jaykorean

fbshipit-source-id: de4bc61853136b923aa786717e7979be8886b9bd
2025-02-26 15:31:47 -08:00
Peter Dillinger
3af905aa68 Format compatibility test cover compressions, including mixed (#13414)
Summary:
The existing format compatibility test had limited coverage of compression options, particularly newer algorithms with and without dictionary compression. There are some subtleties that need to remain consistent, such as index blocks potentially being compressed but *not* using the file's dictionary if they are. This involves detecting (with a rough approximation) builds with the appropriate capabilities.

The other motivation for this change is testing some potentially useful reader-side functionality that has been in place for a long time but has not been exercised until now: mixing compressions in a single SST file. The block-based SST schema puts a compression marker on each block; arguably this is for distinguishing blocks compressed using the algorithm stored in compression_name table property from blocks left uncompressed, e.g. because they did not reach the threshold of useful compression ratio, but the marker can also distinguish compression algorithms / decompressors.

As we work toward customizable compression, it seems worth unlocking the capability to leverage the existing schema and SST reader-side support for mixing compression algorithms among the blocks of a file. Yes, a custom compression could implement its own dynamic algorithm chooser with its own tag on the compressed data (e.g. first byte), but that is slightly less storage efficient and doesn't support "vanilla" RocksDB builds reading files using a mix of built-in algorithms. As a hypothetical example, we might want to switch to lz4 on a machine that is under heavy CPU load and back to zstd when load is more normal. I dug up some data indicating ~30 seconds per output file in compaction, suggesting that file-level responsiveness might be too slow. This agility is perhaps more useful with disaggregated storage, where there is more flexibility in DB storage footprint and potentially more payoff in optimizing the *average* footprint.

In support of this direction, I have added a backdoor capability for debug builds of `ldb` to generate files with a mix of compression algorithms and incorporated this into the format compatibility test. All of the existing "forward compatible" versions (currently back to 8.6) are able to read the files generated with "mixed" compression. (NOTE: there's no easy way to patch a bunch of old versions to have them support generating mixed compression files, but going forward we can auto-detect builds with this "mixed" capability.) A subtle aspect of this support that is that for proper handling of decompression contexts and digested dictionaries, we need to set the `compression_name` table property to `zstd` if any blocks are zstd compressed. I'm expecting to add better info to SST files in follow-up, but this approach here gives us forward compatibility back to 8.6.

However, in the spirit of opening things up with what makes sense under the existing schema, we only support one compression dictionary per file. It will be used by any/all algorithms that support dictionary compression. This is not outrageous because it seems standard that a dictionary is *or can be* arbitrary data representative of what will be compressed. This means we would need a schema change to add dictionary compression support to an existing built-in compression algorithm (because otherwise old versions and new versions would disagree on whether the data dictionary is needed with that algorithm; this could take the form of a new built-in compression type, e.g. `kSnappyCompressionWithDict`; only snappy, bzip2, and windows-only xpress compression lack dictionary support currently).

Looking ahead to supporting custom compression, exposing a sizeable set of CompressionTypes to the user for custom handling essentially guarantees a path for the user to put *versioning* on their compression even if they neglect that initially, and without resorting to managing a bunch of distinct named entities. (I'm envisioning perhaps 64 or 127 CompressionTypes open to customization, enough for ~weekly new releases with more than a year of horizon on recycling.)

More details:
* Reduce the running time (CI cost) of the default format compatibility test by randomly sampling versions that aren't the oldest in a category. AFAIK, pretty much all regressions can be caught with the even more stripped-down SHORT_TEST.
* Configurable make parallelism with J environment variable
* Generate data files in a way that makes them much more eligible for index compression, e.g. bigger keys with less entropy
* Generate enough data files
* Remove 2.7.fb.branch from list because it shows an assertion violation when involving compression.
* Randomly choose a contiguous subset of the compression algorithms X {dictionary, no dictionary} configuration space when generating files, with a number of files > number of algorithms. This covers all the algorithms and both dictionary/no dictionary for each release (but not in all combinations).
* Have `ldb` fail if the specified compression type is not supported by the build.

Other future work needed:
* Blob files in format compatibility test, and support for mixed compression. NOTE: the blob file schema should naturally support mixing compression algorithms but the reader code does not because of an assertion that the block CompressionType (if not no compression) matches the whole file CompressionType. We might introduce a "various" CompressionType for this whole file marker in blob files.
* Do more to ensure certain features and code paths e.g. in the scripts are actually used in the compatibility test, so that they aren't accidentally neutralized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13414

Test Plan: Manual runs with some temporary instrumentation, also a recent revision of this change included a GitHub Actions run of the updated format compatible test: https://github.com/facebook/rocksdb/actions/runs/13463551149/job/37624205915?pr=13414

Reviewed By: hx235

Differential Revision: D70012056

Pulled By: pdillinger

fbshipit-source-id: 9ea5db76ba01a95338ed1a86b0edd71a469c4061
2025-02-25 00:12:34 -08:00
Changyu Bi
3740fccc4b Update for next release 10.0 (#13417)
Summary:
Updated version, HISTORY, compatibility script and folly hash for 10.0 release.

Included a HISTORY.md update backported from 10.0 branch: c1f63e16f0def68ad83209700d486889ae86d8ce.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13417

Test Plan: CI

Reviewed By: hx235

Differential Revision: D70029393

Pulled By: cbi42

fbshipit-source-id: f8276bb31cc69648b47e0cbcd728d2a33fbf531f
2025-02-24 13:28:20 -08:00
Changyu Bi
4c975a7c22 Disable flaky unit test RoundRobinSubcompactionsAgainstPressureToken (#13416)
Summary:
The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/13417174378/job/37480755623?fbclid=IwZXh0bgNhZW0CMTEAAR2pj4E1ua6zMxz4FxnPAPLIz011t1ddjaWPbmFlldfSG7dZGjWGVy-mDkg_aem_40kU2iCmcN93WsmzLZxGsA) and my previous [fix](https://github.com/facebook/rocksdb/pull/13347) did not seem to work. It's likely a test set up issue and disable the test for now since RoundRobin compaction style is not used to reduce some test failure noise.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13416

Test Plan: CI

Reviewed By: hx235

Differential Revision: D70002097

Pulled By: cbi42

fbshipit-source-id: afe0f56363501dab2c9dc297bfbe0dff0ac6aeb3
2025-02-21 12:55:29 -08:00
Maciej Szeszko
5139ff5c29 Conditional check reordering (#13415)
Summary:
This change is addressing a valid concern raised in https://github.com/facebook/rocksdb/pull/13408#discussion_r1966000661.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13415

Test Plan: Existing test collateral.

Reviewed By: cbi42

Differential Revision: D69999071

Pulled By: mszeszko-meta

fbshipit-source-id: 5ebb195b2b83701e06c33bfcb19c57d9ac1c1dc6
2025-02-21 12:42:14 -08:00
Changyu Bi
d7aea6955c Fix stress test DB verification methods (#13409)
Summary:
update VerifyDB() to respect user specified flags when choosing verification method.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13409

Test Plan: existing CI.

Reviewed By: hx235

Differential Revision: D69885644

Pulled By: cbi42

fbshipit-source-id: bbaa931cece3525f00d775639ec7b63ff0101d94
2025-02-21 10:50:01 -08:00
Changyu Bi
3c2c2689b9 Merge support in WBWIMemTable (#13410)
Summary:
added merge support for WBWIMemTable. Most of the preparation work is done in https://github.com/facebook/rocksdb/issues/13387 and https://github.com/facebook/rocksdb/issues/13400. The main code change to support merge is in wbwi_memtable.cc to support reading the Merge value type. The rest of the changes are mostly comment change and tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13410

Test Plan:
- new unit test
- ran `python3 ./tools/db_crashtest.py --txn blackbox  --txn_write_policy=0 --commit_bypass_memtable_one_in=100 --test_batches_snapshots=0 --use_merge=1` for several runs.

Reviewed By: jowlyzhang

Differential Revision: D69885868

Pulled By: cbi42

fbshipit-source-id: b127d95a3027dc35910f6e5d65f3409ba27e2b6b
2025-02-20 20:21:45 -08:00
Maciej Szeszko
1e1c199316 Fix dbstress run - attempt 1 (#13408)
Summary:
This PR attempts to fix the **dbstress failures** post https://github.com/facebook/rocksdb/pull/13354. There are at least 2 high level categories of errors: 1) likely caused by wide-scope snapshot initialization ([issue found by Peter](https://github.com/facebook/rocksdb/pull/13354#discussion_r1960177118)), 2) lack of proper error propagation. Wrt 2), part of the problem is a real miss (we should condition auto refresh on `status().ok()` after calling to `Next` / `Prev`), but another part - [failure in propagating dbstress-injected read error](https://github.com/facebook/rocksdb/pull/13354#discussion_r1960913871) in file deletion is expected and should not be asserted on in dbstress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13408

Test Plan:
Confirmed there are no more errors after running sandcastle crashtest for each of the failing flavors:

```hcl
https://www.internalfb.com/sandcastle/workflow/252201579138859344
https://www.internalfb.com/sandcastle/workflow/3233584532458171962
https://www.internalfb.com/sandcastle/workflow/1283525893806766134
https://www.internalfb.com/sandcastle/workflow/2796735368603351293
https://www.internalfb.com/sandcastle/workflow/3792030886252148966
https://www.internalfb.com/sandcastle/workflow/67553994428973733
https://www.internalfb.com/sandcastle/workflow/3886606478427208295
https://www.internalfb.com/sandcastle/workflow/1684346260642682928
https://www.internalfb.com/sandcastle/workflow/4197354852715406516
https://www.internalfb.com/sandcastle/workflow/535928355663233170
https://www.internalfb.com/sandcastle/workflow/3409224917925569737
```

Reviewed By: cbi42

Differential Revision: D69869766

Pulled By: mszeszko-meta

fbshipit-source-id: 7a5b121218fb1dc0a37887d6fe2a5c07e2b894cf
2025-02-20 11:07:01 -08:00
Peter Dillinger
836e88ab7a Add test for memtable bloom filter with WriteBufferManager (#13398)
Summary:
... to ensure proper cache charging. However, this is a somewhat hazardous combination if there are many CFs and could be the target of future work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13398

Test Plan: this is the test

Reviewed By: hx235

Differential Revision: D69619977

Pulled By: pdillinger

fbshipit-source-id: 9841768584e4688d8fdd0258f3ba9608b67408e5
2025-02-20 10:16:12 -08:00
Sarang Masti
129b7791f9 Bugfix: Ensure statuses are initialized with OK() in SSTFileReader::MultiGet (#13411)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13411

We should intialize statuses with OK rather than IOError to correctly handle cases
like NotFound due to bloom filter. In case of IOError status would be updated
appropriately by the reader

Reviewed By: anand1976

Differential Revision: D69886976

fbshipit-source-id: 92b130168f23633224ff4153bfe46a7d86482b90
2025-02-19 19:38:53 -08:00
Changyu Bi
36838bbf51 Update sequence number assignment method in WBWIMemTable (#13400)
Summary:
This is a preparation for supporting merge in `WBWIMemTable`. This PR updates the sequence number assignment method so that it allows efficient and simple assignment when there are multiple entries with the same user key. This can happen when the WBWI contains Merge operations. This assignment relies on tracking the number of updates issued for each key in each WBWI entry (`WriteBatchIndexEntry::update_count`). Some refactoring is done in WBWI to remove `last_entry_offset` as part of the WBWI state which I find it harder to use correctly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13400

Test Plan: updated unit tests to check that update count is tracked correctly and WBWIMemTable is assigning sequence number as expected.

Reviewed By: pdillinger

Differential Revision: D69666462

Pulled By: cbi42

fbshipit-source-id: 9b18291825017a67c4da3318e8a556aa2971326b
2025-02-19 12:35:56 -08:00
anand76
920d25e34e Initial version of an external table reader interface (#13401)
Summary:
This PR introduces an interface to plug in an external table file reader into RocksDB. The external table reader may support custom file formats that might work better for a specific use case compared to RocksDB native formats. This initial version allows the external table file to be loaded and queried using an `SstFileReader`. In the near future, we will allow it to be used with a limited RocksDB instance that allows bulkload but not live writes.

The model of a DB using an external table reader is a read only database allowing bulkload and atomic replace in the bottommost level only. Live writes, if supported in the future, are expected to use block based table files in higher levels. Tombstones, merge operands, and non-zero sequence numbers are expected to be present only in non-bottommost levels. External table files are assumed to have only Puts, and all keys implicitly have sequence number 0.

TODO (in future PRs) -
1. Add support for external file ingestion, with safety mechanisms to prevent accidental writes
2. Add support for atomic column family replace
3. Allow custom table file extensions
4. Add a TableBuilder interface for use with `SstFileWriter`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13401

Reviewed By: pdillinger

Differential Revision: D69689351

Pulled By: anand1976

fbshipit-source-id: c5d5b92d56fd4d0fc43a77c4ceb0463d4f479bda
2025-02-19 09:52:08 -08:00
Sarang Masti
a0edca32cf MultiGet support in SstReader (#13403)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13403

Add MultiGet support in SstReader. Today we only have iteration support and this change
also adds MultiGet support to SstFileReader if some application wants to use it.

Reviewed By: anand1976

Differential Revision: D69514499

fbshipit-source-id: 20e85a4bd13a3a9f45dacb223c1a4541fb87f561
2025-02-19 08:09:36 -08:00
Hui Xiao
9be5e15a26 Disable auto_refresh_iterator_with_snapshot temporarily in stress test (#13402)
Summary:
Context/Summary: recent crash test failures seem to find issues with recently added auto_refresh_iterator_with_snapshot and prefix/scan, injected read. For now, let's disable the auto_refresh_iterator_with_snapshot.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13402

Test Plan: monitor CI

Reviewed By: mszeszko-meta

Differential Revision: D69677731

Pulled By: hx235

fbshipit-source-id: eea6630e96d53fba8dbf3877a49819690dfab2f6
2025-02-18 11:51:27 -08:00
Changyu Bi
4b5f0a4fcc Fix GetMergeOperands() in ReadOnly and SecondaryDB (#13396)
Summary:
Noticed that the `do_merge` parameter is not properly set while working on memtable code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13396

Test Plan: updated unit test for the read-only db case.

Reviewed By: jaykorean

Differential Revision: D69505015

Pulled By: cbi42

fbshipit-source-id: d4c64ca7bba31fe26aa41a29cbc55835d9f1f116
2025-02-18 11:01:19 -08:00
Hui Xiao
7069691f7e Disable track_and_verify_wals completely temporarily (#13405)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/13263 and https://github.com/facebook/rocksdb/pull/13360 disabled `track_and_verify_wals` with some injection under TXN temporarily but recent stress tests has found more issues this feature surfaced even with the previous disabling. Disabling the feature **completely** now for stabilizing CI while debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13405

Test Plan: Monitor CI

Reviewed By: cbi42

Differential Revision: D69759276

Pulled By: hx235

fbshipit-source-id: 501a3561acb9daa834f874095f9a66ae6ae5aa42
2025-02-18 09:39:00 -08:00
Hui Xiao
6aacec07dc Call Clean() on JobContext before destruction in UT (#13406)
Summary:
**Context/Summary:**
It's [documented (affcad0cc9/db/job_context.h (L230)) that `// For non-empty JobContext Clean() has to be called at least once before before destruction`. This is violated in a UT accidentally so causing the assertion failure `assert(logs_to_free.size() == 0);` in` ~JobContext`. This PR is to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13406

Test Plan: Monitor for future UT assertion failure in `TEST_F(DBWALTest, FullPurgePreservesRecycledLog) `

Reviewed By: cbi42

Differential Revision: D69759725

Pulled By: hx235

fbshipit-source-id: dd1617b370a2c69daba657287dcf258542f92ef5
2025-02-18 09:37:03 -08:00
Hui Xiao
affcad0cc9 Fix corrupted wal number when predecessor wal corrupts + minor cleanup (#13359)
Summary:
**Context/Summary:**

02b4197544 recently added the ability to detect WAL hole presents in the predecessor WAL. It forgot to update the corrupted wal number to point to the predecessor WAL in that corruption case. This PR fixed it.

As a bonus, this PR also (1) fixed the `FragmentBufferedReader()` constructor API to expose less parameters as they are never explicitly passed in in the codebase (2) a INFO log wording (3) a parameter naming typo (4) the reporter naming

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13359

Test Plan:
1. Manual printing to ensure the corrupted wal number is set to the right number
2. Existing UTs

Reviewed By: jowlyzhang

Differential Revision: D69068089

Pulled By: hx235

fbshipit-source-id: f7f8a887cded2d3a26cf9982f5d1d1ab6a78e9e1
2025-02-13 21:49:51 -08:00
Hui Xiao
f6b2cdd350 Disable secondary test with sst truncation deletion; API clarification (#13395)
Summary:
**Context/Summary:**
Secondary DB relies on open file descriptor of the shared SST file in primary DB to continue being able to read the file even if that file is deleted in the primary DB. However, this won't work if the file is truncated instead of deleted, which triggers an "truncated block read" corruption in stress test on secondary db reads. Truncation can happen if RocksDB implementation of SSTFileManager and `bytes_max_delete_chunk>0` are used. This PR is to disable such testing combination in stress test and clarify the related API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13395

Test Plan:
- Manually repro-ed with below UT. I'm in favor of not including this UT in the codebase as it should be self-evident from the API comment now about the incompatiblity. Secondary DB is in a direction of being replaced by Follower so we should minimize edge-case tests for code with no functional change for a to-be-replaced functionality.
```
TEST_F(DBSecondaryTest, IncompatibleWithPrimarySSTTruncation) {
  Options options;
  options.env = env_;
  options.disable_auto_compactions = true;
  options.sst_file_manager.reset(NewSstFileManager(
      env_, nullptr /*fs*/, "" /*trash_dir*/, 2024000 /*rate_bytes_per_sec*/,
      true /*delete_existing_trash*/, nullptr /*status*/,
      0.25 /*max_trash_db_ratio*/, 1129 /*bytes_max_delete_chunk*/));
  Reopen(options);

  ASSERT_OK(Put("key1", "old_value"));
  ASSERT_OK(Put("key2", "old_value"));
  ASSERT_OK(Flush());
  ASSERT_OK(Put("key1", "new_value"));
  ASSERT_OK(Put("key3", "new_value"));
  ASSERT_OK(Flush());

  Options options1;
  options1.env = env_;
  options1.max_open_files = -1;
  Reopen(options);
  OpenSecondary(options1);
  ASSERT_OK(db_secondary_->TryCatchUpWithPrimary());

  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
      "DeleteScheduler::DeleteTrashFile:Fsync", [&](void*) {
        std::string value;
        Status s = db_secondary_->Get(ReadOptions(), "key2", &value);
        assert(s.IsCorruption());
        assert(s.ToString().find("truncated block read") !=
            std::string::npos);
      });
  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();

  ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));

  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}
```
- Monitor future stress test

Reviewed By: jowlyzhang

Differential Revision: D69499694

Pulled By: hx235

fbshipit-source-id: 57525b9841897f42aecb758a4d3dd3589367dcd9
2025-02-12 11:00:36 -08:00
Maciej Szeszko
8234d67e5a Auto refresh iterator with snapshot (#13354)
Summary:
# Problem
Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly `Refresh'ed`, the iterator will hold on to the `Init`-time assigned `SuperVersion` throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (_potentially_ referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of _logical_ backups (outside of RocksDB `BackupEngine`).

# Solution
Building on top of the `Refresh(const Snapshot* snapshot)` API introduced in https://github.com/facebook/rocksdb/pull/10594, we're adding a new `ReadOptions` opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in `read_options.snapshot` at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13354

Test Plan:
**Correctness:** New test to demonstrate the auto refresh behavior in contrast to legacy iterator: `./db_iterator_test --gtest_filter=*AutoRefreshIterator*`.

**Stress testing:** We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in `db_stress` as we reasonably can with random feature on/off configuration in db_crashtest.py.

# Benchmarking

The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build, 3-5 times for each respective category or till convergence. In addition, we configured aggressive threshold of 1 second for new `Superversion` creation. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1].

**TLDR;** Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) in throughput relative to the baseline, if any, is no more than 1-2%.

**Snapshot initialization approach:**

This feature is only effective on iterators with well-defined `snapshot` passed via `ReadOptions` config. We modified the existing `db_bench` program to reflect that constraint. However, it quickly turned out that the actual `Snapshot*` initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking _potentially_ adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' `Snapshot*` for all the future scans _before_ running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be.

## "DB in memory" benchmark

**DB Setup**

1. Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected `000004.log` file is 541MB in size).

```
./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000  max_write_buffer_number=2 -compression_type=none
```

2. As a part of recovery in subsequent DB open, WAL will be processed to one or more SST files during the recovery. We're selecting a large block cache (`cache_size` parameter in `db_bench` script) suitable for holding the entire DB to test the “hot path” CPU overhead.

```
./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=1 -async_io=1 -num=? -seek_nexts=? -cache_size=? -write_buffer_size=? -auto_refresh_iterator_with_snapshot={0|1}
```

  | seek_nexts=100; num=2,000,000 | seek_nexts = 20,000; num=50000  | seek_nexts = 400,000; num=2000
-- | -- | -- | --
baseline | 36362 (± 300) ops/sec, 928.8 (± 23) MB/s, 99.11% block cache hit  | 52.5 (± 0.5) ops/sec, 1402.05 (± 11.85) MB/s, 99.99% block cache hit | 156.2 (± 6.3) ms / op, 1330.45 (± 54) MB/s, 99.95% block cache hit
auto refresh |  35775.5 (± 537) ops/sec, 926.65 (± 13.75) MB/s, 99.11% block cache hit |  53.5 (± 0.5) ops/sec, 1367.9 (± 9.5) MB/s, 99.99% block cache hit |  162 (± 4.14) ms / op, 1281.35 (± 32.75) MB/s, 99.95% block cache hit

_-cache_size=5000000000 -write_buffer_size=3200000000 -max_write_buffer_number=2_

  | seek_nexts=3,500,000; num=100
-- | --
baseline | 1447.5 (± 34.5) ms / op, 1255.1 (± 30) MB/s, 98.98% block cache hit
auto refresh | 1473.5 (± 26.5) ms / op, 1232.6 (± 22.2) MB/s, 98.98% block cache hit

_-cache_size=17680000000 -write_buffer_size=14500000000 -max_write_buffer_number=2_

  | seek_nexts=17,500,000; num=10
-- | --
baseline | 9.11 (± 0.185) s/op, 997 (± 20) MB/s
auto refresh | 9.22 (± 0.1) s/op, 984 (± 11.4) MB/s

[1]

### Specs

  | Property | Value
-- | --
RocksDB | version 10.0.0
Date | Mon Feb  3 23:21:03 2025
CPU | 32 * Intel Xeon Processor (Skylake)
CPUCache | 16384 KB
Keys | 16 bytes each (+ 0 bytes user-defined timestamp)
Values | 100 bytes each (50 bytes after compression)
Prefix | 0 bytes
RawSize | 5.5 MB (estimated)
FileSize | 3.1 MB (estimated)
Compression | Snappy
Compression sampling rate | 0
Memtablerep | SkipListFactory
Perf Level | 1

Reviewed By: pdillinger

Differential Revision: D69122091

Pulled By: mszeszko-meta

fbshipit-source-id: 147ef7c4fe9507b6fb77f6de03415bf3bec337a8
2025-02-11 19:36:47 -08:00
Jay Huh
a30c0204cc Set Options File Number for CompactionInput under Mutex Lock (#13394)
Summary:
Options File Number to be read by remote worker is part of the `CompactionServiceInput`. We've been setting this in `ProcessKeyValueCompactionWithCompactionService()` while the db_mutex is not held. This needs to be accessed while the mutex is held. The value can change as part of `SetOptions() -> RenameTempFileToOptionsFile()` as in following.

e6972196bc/db/db_impl/db_impl.cc (L5595-L5596)

Keep this value in memory during `CompactionJob::Prepare()` which is called while the mutex is held, so that we can easily access this later without mutex when building the CompactionInput for the remote compaction.

Thanks to the crash test. This was surfaced after https://github.com/facebook/rocksdb/issues/13378 merged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13394

Test Plan:
Unit Test
```
./compaction_service_test
```

Crash Test
```
COERCE_CONTEXT_SWITCH=1 COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j100 dbg
```
```
python3 -u tools/db_crashtest.py blackbox --enable_remote_compaction=1
```

Reviewed By: jowlyzhang

Differential Revision: D69496313

Pulled By: jaykorean

fbshipit-source-id: 7e38e3cb75d5a7708beb4883e1a138e2b09ff837
2025-02-11 19:11:36 -08:00
aletar89
e6972196bc Add python binding to LANGUAGE-BINDINGS.md (#13391)
Summary:
The only actively maintained python binding is RocksDict

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13391

Reviewed By: jowlyzhang

Differential Revision: D69431688

Pulled By: cbi42

fbshipit-source-id: 5e564118769a86cb8834a4faa5d852a54bbfea64
2025-02-10 17:38:40 -08:00
Changyu Bi
3f460ad0d2 Reverse the order of updates to the same key in WriteBatchWithIndex (#13387)
Summary:
as a preparation to support merge in [WBWIMemtable](d48af21386/memtable/wbwi_memtable.h (L31)), this PR updates how we [order updates to the same key](d48af21386/utilities/write_batch_with_index/write_batch_with_index_internal.cc (L694-L697)) in WriteBatchWithIndex. Specifically, the order is now reversed such that more recent update is ordered first. This will make iterating from WriteBatchWithIndex much easier since the key ordering in WBWI now matches internal key order where keys with larger sequence number are ordered first. The ordering is now explicitly documented above the declaration for `WriteBatchWithIndex` class.

Places that use `WBWIIteratorImpl` and assume key ordering are updated. The rest is test and comments update.
This will affect users who use WBWIIterator directly, the output of GetFromBatch, GetFromBatchAndDB or NewIteratorWithBase are not affected. Users are only affected if they may issue multiple updates to the same key. If WriteBatchWithIndex is created with `overwrite_key=true`, one the the updates needs to be Merge.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13387

Test Plan: we have some good coverage of WBWI, I updated some existing tests and added a test for `WBWIIteratorImpl`.

Reviewed By: pdillinger

Differential Revision: D69421268

Pulled By: cbi42

fbshipit-source-id: d97eec4ee74aeac3937c9758041c7713f07f9676
2025-02-10 17:15:47 -08:00
Peter Dillinger
c9ce4a3d6b Improve atomicity of SetOptions, skip manifest write (#13384)
Summary:
Motivated by code review issue in https://github.com/facebook/rocksdb/issues/13316, we don't want to release the DB mutex in SetOptions between updating the cfd latest options and installing the new Version and SuperVersion. SetOptions uses LogAndApply to install a new Version but this currently incurs an unnecessary manifest write. (This is not a big performance concern because SetOptions dumps a new OPTIONS file, which is much larger than the redundant manifest update.) Since we don't want IO while holding the DB mutex, we need to get rid of the manifest write, and that's what this change does. We introduce a kind of dummy VersionEdit that allows the existing code paths of LogAndApply to install a new Version (with the updated mutable options), recompute resulting compaction scores etc., but without the manifest write.

Part of the validation for this is new assertions in SetOptions verifying the consistency of the various copies of MutableCFOptions. (I'm not convinced we need it in SuperVersion in addition to Version, but that's not for here and now.) These checks depend on defaulted `operator==` so depend on C++20.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13384

Test Plan:
New unit test in addition to new assertions. SetOptions already tested heavily in crash test. Used
`ROCKSDB_CXX_STANDARD=c++20 make -j100 check` to ensure the new assertions are verified

Reviewed By: cbi42

Differential Revision: D69408829

Pulled By: pdillinger

fbshipit-source-id: 4cf026010c6bb381e0ea27567cce2708d4678e7d
2025-02-10 16:46:13 -08:00
Hui Xiao
1f36399a77 Blog post about the mitigated misconfig bug (#13386)
Summary:
**Context/Summary:** as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13386

Test Plan: Run the webpage locally according to https://github.com/facebook/rocksdb/tree/main/docs and check everything is fine

Reviewed By: jaykorean

Differential Revision: D69334257

Pulled By: hx235

fbshipit-source-id: 4e9b6dbddd5035b9277044c6cb0ac97b3819ec6c
2025-02-10 13:10:33 -08:00