Summary:
## Background
Compaction statistics are collected at various levels across different classes and structs.
* `InternalStats::CompactionStats`: Per-level Compaction Stats within a job (can be at subcompaction level which later get aggregated to the compaction level)
* `InternalStats::CompactionStatsFull`: Contains two per-level compaction stats - `output_level_stats` for primary output level stats and `proximal_level_stats` for proximal level stats. Proximal level statistics are only relevant when using Tiered Storage with the per-key placement feature enabled.
* `InternalStats::CompactionOutputsStats`: Simplified version of `InternalStats::CompactionStats`. Only has a subset of fields from `InternalStats::CompactionStats`
* `CompactionJobStats`: Job-level Compaction Stats. (can be at subcompaction level which later get aggregated to the compaction level)
Please note that some fields in Job-level stats are not in Per-level stats and they don't map 1-to-1 today.
## Issues
* In non-remote compactions, proximal level compaction statistics were not being aggregated into job-level statistics. Job level statistics were missing stats for proximal level for tiered storage compactions with per-key-replacement feature enabled.
* During remote compactions, proximal level compaction statistics were pre-aggregated into job-level statistics on the remote side. However, per-level compaction statistics were not part of the serialized compaction result, so that primary host lost that information and weren't able to populate `per_key_placement_comp_stats_` and `internal_stats_.proximal_level_stats` properly during the installation.
* `TieredCompactionTest` was only checking if (expected stats > 0 && actual stats > 0) instead actual value comparison
## Fixes
* Renamed `compaction_stats_` to `internal_stats_` for `InternalStats::CompactionStatsFull` in `CompactionJob` for better readability
* Removed the usage of `InternalStats::CompactionOutputsStats` and consolidated them to `InternalStats::CompactionStats`.
* Remote Compactions now include the internal stats in the serialized `CompactionServiceResult`. `output_level_stats` and `proximal_level_stats` get later propagated in sub_compact output stats accordingly.
* `CompactionJob::UpdateCompactionJobStats()` now takes `CompactionStatsFull` and aggregates the `proximal_level_stats` as well
* `TieredCompactionTest` is now doing the actual value comparisons for input/output file counts and record counts. Follow up is needed to do the same for the bytes read / written.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13464
Test Plan:
Unit Tests updated to verify stats
```
./compaction_service_test
```
```
./tiered_compaction_test
```
Reviewed By: pdillinger
Differential Revision: D71220393
Pulled By: jaykorean
fbshipit-source-id: ad70bffd9614ced683f90c7570a17def9b5c8f3f
Summary:
This PR adds a check for an invariant of sequence number during recovery, that it should not be set backward. This is inspired by a recent SEV that is caused by a software bug. It is a relatively cheap and straightforward check that RocksDB can do to avoid silently opening the DB in a corrupted state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13465
Test Plan:
Existing tests should cover the case when the invariant is met
The corrupted state is manually tested using aforementioned bug.
Reviewed By: hx235
Differential Revision: D71226513
Pulled By: jowlyzhang
fbshipit-source-id: cd8056fa6653d44ceeb9ba9b4693ab0660a53b4e
Summary:
**Context/Summary:**
For users who are interested in knowing how efficient their compaction in reducing L0 files or how bad their long-running compaction in "locking" L0 files, they now have a reference point "L0 files in the CF pre compaction" for their input compaction files.
- Compared to the existing stats or exposing in some other way, exposing this info in CompactionJobInfo allows users to compare it with other compaction data (e.g, compaction input num, compaction reason) of within **one** compaction (of per-compaction granularity).
- If this number is high while their "short-running" compaction has little L0 files input, then those compaction may have a room for improvement. Similar for those long-running compaction. This PR is to add a new field `CompactionJobInfo::num_l0_files_pre_compaction` for that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13462
Test Plan: - Piggyback on an existing test
Reviewed By: jaykorean
Differential Revision: D71124938
Pulled By: hx235
fbshipit-source-id: aa47c9c86c62d9425771b320f5636e50671fd289
Summary:
The original implementation of NVMe write lifetime hints (https://github.com/facebook/rocksdb/pull/3095) assumed a flexible interface which decouples file creation from the explicit act of setting write lifetime hint (see `PosixWritableFile` for more context). However, there are existing file systems implementations (ex. Warm Storage) that require all the options (including file write lifetime hints) to be specified once at the time of the actual `FSWritableFile` object instantiation. We're extending the `FileOptions` with `Env::WriteLifeTimeHint` and patch existing callsites accordingly to enable one-shot metadata setup for those more constraint implementations.
NOTE: Today `CalculateSSTWriteHint` only sets write lifetime hint for Level compactions. We'll fill that gap in following PRs and add calculation for Universal Compactions which would unblock Zippy's use case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13461
Reviewed By: anand1976
Differential Revision: D71144645
Pulled By: mszeszko-meta
fbshipit-source-id: 6c09b62a360d48bd6e4fb08a1265bce2a49f3f4a
Summary:
In hopes of eventually removing some ugly and awkard code for compress_format_version < 2, users can no longer write files in that format and its read support is marked deprecated. For continuing to test that read support, there is a back door to writing the files in unit tests.
If format_version < 2 is specified, it is quietly sanitized to 2. (This is similar to other BlockBasedTableOptions.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13463
Test Plan: unit tests updated.
Reviewed By: hx235
Differential Revision: D71152916
Pulled By: pdillinger
fbshipit-source-id: 95be55e86f93f09fd898223578b9381385c3ccd8
Summary:
With generalized age-based tiering (work-in-progress), the "warm tier" data will no longer necessarily be placed in the second-to-last level (also known as the "penultimate level").
Also, the cold tier may no longer necessarily be at the last level, so we need to rename options like `preclude_last_level_seconds` to `preclude_cold_tier_seconds`, but renaming options is trickier because it can be a breaking change for consuming applications. We will do this later as a follow up.
**Minor fix included**: Fixed one `use-after-move` in CompactionPicker
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13460
Test Plan: CI
Reviewed By: pdillinger
Differential Revision: D71059486
Pulled By: jaykorean
fbshipit-source-id: fd360cdf719e015bf9f9e3f6f1663438226566a4
Summary:
This PR adds support for PerKeyPlacement in Remote Compaction.
The `seqno_to_time_mapping` is already available from the table properties of the input files. `preserve_internal_time_seconds` and `preclude_last_level_data_seconds` are directly read from the OPTIONS file upon db open in the remote worker. The necessary changes include:
- Add `is_penultimate_level_output` and `file_temperature` to the `CompactionServiceOutputFile`
- When building the output for the remote compaction, get the outputs for penultimate level and last level separately, serialize them with the two additional information added in this PR.
- When deserializing the result from the primary, SubcompactionState's `GetOutputs()` now takes `is_penultimate_level`. This allows us to determine which level to place the output file.
- Include stats from `compaction_stats.penultimate_level_stats` in the remote compaction result
# To Follow up
- Stats to be fixed. Stats are not being populated correctly for PerKeyPlacement even for non-remote compactions.
- Clean up / Reconcile the "penultimate" naming by replacing with "proximal"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13459
Test Plan:
Updated the unit test
```
./compaction_service_test
```
Reviewed By: pdillinger
Differential Revision: D71007211
Pulled By: jaykorean
fbshipit-source-id: f926e56df17239875d849d46b8b940f8cd5f1825
Summary:
[Experiment]
This PR is a followup to https://github.com/facebook/rocksdb/pull/13408. Thick bandaid of ignoring all injected read errors in context of periodic iterator auto refreshes in db stress proved to be effective. We confirmed our theory that errors are not a really a consequence / defect related to this new feature but rather due to subtle ways in which downstream code paths handle their respective IO failures. In this change we're replacing a thick 'ignore all IO read errors' bandaid in `no_batched_ops_stress` with a much smaller, targeted patches in obsolete files purge / delete codepaths, table block cache reader, table cache lookup to make sure we don't miss signal and ensure there's a single mechanism for ignoring error injection in db stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13447
Reviewed By: hx235
Differential Revision: D70794787
Pulled By: mszeszko-meta
fbshipit-source-id: c5fcd4780d82357c407f53bf0bb22fc38f7bd277
Summary:
Add debug logging when the Wait() does not return `kSuccess` so that we can compare the version state that was printed by the logging added in https://github.com/facebook/rocksdb/issues/13427 upon InputFileCheck failure.
# Test Plan
CI + Tested with Temporary Change in Meta Internal Infra
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13452
Reviewed By: hx235
Differential Revision: D70898963
Pulled By: jaykorean
fbshipit-source-id: d591b82f2df173b5e01f6552230844ce95155256
Summary:
`nullptr` is preferable to `0` or `NULL`. Let's use it everywhere so we can enable `-Wzero-as-null-pointer-constant`.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Reviewed By: dtolnay
Differential Revision: D70818166
fbshipit-source-id: 4658fb004676fe2686249fdd8ecb322dec8aa63d
Summary:
Primarily, fix an issue from https://github.com/facebook/rocksdb/issues/13316 with opening secondary DB with preserve/preclude option (crash test disable in https://github.com/facebook/rocksdb/issues/13439). The issue comes down to mixed-up interpretations of "read_only" which should now be resolved. I've introduced the stronger notion of "unchanging" which means the VersionSet never sees any changes to the LSM tree, and the weaker notion of "read_only" which means LSM tree changes are not written through this VersionSet/etc. but can pick up externally written changes. In particular, ManifestTailer should use read_only=true (along with unchanging=false) for proper handling of preserve/preclude options.
A new assertion in VersionSet::CreateColumnFamily to help ensure sane usage of the two boolean flags is incompatible with the known wart of allowing CreateColumnFamily on a read-only DB. So to keep that assertion, I have fixed that issue by disallowing it. And this in turn required downstream clean-up in ldb, where I cleaned up some call sites as well.
Also, rename SanitizeOptions for ColumnFamilyOptions to SanitizeCfOptions, for ease of search etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13441
Test Plan:
* Added preserve option to a test in db_secondary_test, which reproduced the failure seen in the crash test.
* Revert https://github.com/facebook/rocksdb/issues/13439 to re-enable crash test functionality
* Update some tests to deal with disallowing CF creation on read-only DB
* Add some testing around read-only DBs and CreateColumnFamily(ies)
* Resurrect a nearby test for read-only DB to be sure it doesn't write to the DB dir. New EnforcedReadOnlyReopen should probably be used in more places but didn't want to attempt a big migration here and now. (Suggested follow-up.)
Reviewed By: jowlyzhang
Differential Revision: D70808033
Pulled By: pdillinger
fbshipit-source-id: 486b4e9f9c9045150a0ebb9cb302753d03932a3f
Summary:
In case the primary host has a new option added which isn't available in the remote worker yet, the remote compaction currently fails. In most cases, these new options are not relevant to the remote compaction and the worker should be able to move on by ignoring it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13443
Test Plan: Verified internally in Meta Infra.
Reviewed By: anand1976
Differential Revision: D70744359
Pulled By: jaykorean
fbshipit-source-id: eb6a388c2358a7f8089f2e35a378b7017b9e03f3
Summary:
If compaction job needs to be aborted inside `Schedule()` or `Wait()` today (e.g. Primary host is shutting down), the only two options are the following
- Handle it as failure by returning `CompactionServiceJobStatus::kFailure`
- Return `CompactionServiceJobStatus::kUseLocal` and let the compaction move on locally and eventually succeed or fail depending on the timing
In this PR, we are introducing a new status, `CompactionServiceJobStatus::kAborted`, so that the implementation of `Schedule()` and `Wait()` can return it. Just like how `CompactionServiceJobStatus::kFailure` is handled, compaction will not move on and fail, but the status will be returned as `Status::Aborted()` instead of `Status::Incomplete()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13438
Test Plan:
Unit Test added
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.AbortedWhileWait*"
```
Reviewed By: anand1976, hx235
Differential Revision: D70655355
Pulled By: jaykorean
fbshipit-source-id: 22614ce9c7455cda649b15465625edc93978fe11
Summary:
I have a place I want to use this helper method inside the Sally codebase. I have this functionality in my Sally diff right now, but I think it is generic enough to warrant putting alongside `Env::PriorityToString`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13440
Test Plan: Just the compiler and CI checks are sufficient IMO.
Reviewed By: hx235
Differential Revision: D70664597
Pulled By: archang19
fbshipit-source-id: 341de6c6e311a3f421ad093c2c216e5caa5034dd
Summary:
This PR adds the ability to use an ExternalTableBuilder through the SstFileWriter to create external tables. This is a counterpart to https://github.com/facebook/rocksdb/issues/13401 , which adds the ExternalTableReader. The support for external tables is confined to ingestion only DBs, with external table files ingested into the bottommost level only. https://github.com/facebook/rocksdb/issues/13431 enforces ingestion only DBs by adding a disallow_memtable_writes column family option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13434
Test Plan: New unit tests in table_test.cc
Reviewed By: pdillinger
Differential Revision: D70532054
Pulled By: anand1976
fbshipit-source-id: a837487eadfabed9627a0eceb403bfc5fc2c427c
Summary:
Add an unordered_map of name/value pairs in ReadOptions::property_bag, similar to IOOptions::property_bag. It allows users to pass through some custom options to an external table.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13436
Reviewed By: jaykorean
Differential Revision: D70649609
Pulled By: anand1976
fbshipit-source-id: 9b14806a9f3599b861827bd4ae6e948861edc51a
Summary:
PR https://github.com/facebook/rocksdb/issues/13316 broke some crash test cases in DBImplSecondary, from combining test_secondary=1 and preserve_internal_time_seconds>0. Disabling that while investigating the fix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13439
Test Plan: manual blackbox_crash_test runs with forced test_secondary=1
Reviewed By: anand1976
Differential Revision: D70656373
Pulled By: pdillinger
fbshipit-source-id: fa2139e90bbe64ec8ebb062877d9337894ea3b43
Summary:
... to better support "ingestion only" column families such as those using an external file reader as in https://github.com/facebook/rocksdb/issues/13401.
It would be possible to implement this by getting rid of the memtable for that CF, but it quickly because clear that such an approach would need to update a lot of places to deal with such a possibility. And we already have logic to optimize reads when a memtable is empty. We put a vector memtable in place to minimize overheads of an empty memtable.
There are three layers of defense against writes to the memtable:
* WriteBatch ops to a disallowed CF will fail immediately, without waiting for Write(). For this check to work, we need a ColumnFamilyHandle and because of that, we don't support disallow_memtable_writes on the default column family.
* MemtableInserter will reject writes to disallowed CFs. This is needed to protect re-open with disallow when there are existing writes in a WAL.
* The placeholder memtable is marked immutable. This will cause an assertion failure on attempt to write, such as in case of bug or regression.
Suggested follow-up:
* Remove the limitation on using the option with the default column family, perhaps by solving https://github.com/facebook/rocksdb/issues/13429 more generally or perhaps with some specific check before the first memtable write of the batch (but potential CPU overhead for such a check - there's likely optimization opportunities around ColumnFamilyMemTables).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13431
Test Plan:
unit tests added
Performance: A db_bench call designed to realistically focus on the CPU cost of writes:
```
./db_bench -db=/dev/shm/dbbench1 --benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -num_column_families=20 -disable_wal -write_buffer_size=1234000
```
Running before & after tests at the same time on the same machine, 40 iterations each, average ops/s, DEBUG_LEVEL=0, remove slowest run of each:
Before: 772466
After: 773785 (0.2% faster)
Likely within the noise, as if there was any change, we would expect a slight regression.
Reviewed By: anand1976
Differential Revision: D70495936
Pulled By: pdillinger
fbshipit-source-id: 306f7e737f87c1fbb52c5805f3cadb6e8ced9b40
Summary:
This is an unexpectedly complex follow-up to https://github.com/facebook/rocksdb/issues/13269.
This change solves (and detects regressed) inconsistencies between whether a CF's SuperVersion is configured with a preserve/preclude option and whether it gets a usable SeqnoToTimeMapping. Operating with preserve/preclude and no usable mapping is degraded functionality we need to avoid. And no mapping is useful for actually disabling the feature (except with respect to existing SST files, but that's less of a concern for now).
The challenge is that how we maintain the DB's SeqnoToTimeMapping can depend on all the column families, and we don't want to iterate over all column families *for each column family* (e.g. on initially creating each). The existing code was a bit relaxed:
* On initially creating or re-configuring a CF, we might install an empty mapping, but soon thereafter (after releasing and re-acquiring the DB mutex) re-install another SuperVersion with a useful mapping.
The solution here is to refactor the logic so that there's a distinct but related workflow for (a) ensuring a quality set of mappings when we might only be considering a single CF (`EnsureSeqnoToTimeMapping()`), and (b) massaging that set of mappings to account for all CFs (`RegisterRecordSeqnoTimeWorker`) which doesn't need to re-install new SuperVersions because each CF already has good mappings and will get updated SuperVersions when the periodic task adds new mappings. This should eliminate the extra SuperVersion installs associated with preserve/preclude on CF creation or re-configure, making it the same as any other CF.
Some more details:
* Some refactorings such as removing new_seqno_to_time_mapping from SuperVersionContext. (Now use parameter instead of being stateful.)
* Propagate `read_only` aspect of DB to more places so that we can pro-actively disable preserve/preclude on read-only DBs, so that we don't run afoul of the assertion expecting SeqnoToTime entries.
* Introduce a utility struct `MinAndMaxPreserveSeconds` for aggregating preserve/preclude settings in a useful way, sometimes on one CF and sometimes across multiple CFs. Much cleaner! (IMHO)
* Introduce a function `InstallSuperVersionForConfigChange` that is a superset of `InstallSuperVersionAndScheduleWork` for when a CF is new or might have had a change to its mutable options.
* Eliminate redundant re-install SuperVersions of created "missing" CFs in DBImpl::Open.
Intended follow-up:
* Ensure each flush has an "upper bound" SeqnoToTime entry, which would resolve a FIXME in tiered_compaction_test, but causes enough test churn to deserve its own PR + investigation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13316
Test Plan:
This change is primarily validated by a new assertion in SuperVersion::Init to ensure consistency between (a) presence of any SeqnoToTime mappings in the SuperVersion and (b) preserve/preclude option being currently set.
One unit test update was needed because we now ensure at least one SeqnoToTime entry is created on any DB::Open with preserve/preclude, so that there is a lower bound time on all the future data writes. This required a small hack in associating the time with Seqno 1 instead of 0, which is reserved for "unspecified old."
Reviewed By: cbi42
Differential Revision: D70540638
Pulled By: pdillinger
fbshipit-source-id: bb419fdbeb5a1f115fc429c211f9b8efaf2f56d7
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13435
We've noticed the default CRC32c function gets executed when running on aarch64 cpus within our servers
Issue is that ROCKSDB_AUXV_GETAUXVAL_PRESENT evaluates to false
This fix enables the flag internally and reverts the previous fix, landed with D70423483
Reviewed By: pdillinger
Differential Revision: D70584250
fbshipit-source-id: 28e41316187c474fdfaf854f301ad14b6721fcad
Summary:
when reading with ReadOptions::read_tier = kPersistedTier and with a snapshot, MultiGet allows the case where some CF is read before a flush and some CF is read after the flush. This is not desirable, especially when atomic_flush is enabled and users use MultiGet to do some consistency checks on the data in SST files. This PR updates the code path for SuperVersion acquisition to get a consistent view across when kPersistedTier is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13433
Test Plan: a new unit test that could be flaky without this change.
Reviewed By: jaykorean
Differential Revision: D70509688
Pulled By: cbi42
fbshipit-source-id: 80de96f94407af9bb2062b6a185c61f65827c092
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13432
We've noticed the default CRC32c function gets executed when running on aarch64 cpus within our servers
Issue is that ROCKSDB_AUXV_GETAUXVAL_PRESENT evaluates to false
This fix allows the usage of hardware-accelerated crc32 within our fleet
Reviewed By: jaykorean
Differential Revision: D70423483
fbshipit-source-id: 601da3fbf156e3e40695eb76ee5d37f67f83d427
Summary:
This adds a test that attempts DeleteRange() with PlainTable (not supported) and shows that it not only puts the DB in failed write mode, it (a) breaks WriteBatch atomicity for readers, because they can see just part of a failed WriteBatch, and (b) makes the DB not recoverable (without manual intervention) if using WAL.
Note: WriteBatch atomicity is not clearly documented but indicated at the top of write_batch.h and the wiki page for Transactions, even without Transactions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13428
Test Plan: this is the test
Reviewed By: anand1976
Differential Revision: D70332226
Pulled By: pdillinger
fbshipit-source-id: 67bc4de68833a80578e48baa9d3a4f23f1600f3c
Summary:
The existing format compatibility test had limited coverage of compression options, particularly newer algorithms with and without dictionary compression. There are some subtleties that need to remain consistent, such as index blocks potentially being compressed but *not* using the file's dictionary if they are. This involves detecting (with a rough approximation) builds with the appropriate capabilities.
The other motivation for this change is testing some potentially useful reader-side functionality that has been in place for a long time but has not been exercised until now: mixing compressions in a single SST file. The block-based SST schema puts a compression marker on each block; arguably this is for distinguishing blocks compressed using the algorithm stored in compression_name table property from blocks left uncompressed, e.g. because they did not reach the threshold of useful compression ratio, but the marker can also distinguish compression algorithms / decompressors.
As we work toward customizable compression, it seems worth unlocking the capability to leverage the existing schema and SST reader-side support for mixing compression algorithms among the blocks of a file. Yes, a custom compression could implement its own dynamic algorithm chooser with its own tag on the compressed data (e.g. first byte), but that is slightly less storage efficient and doesn't support "vanilla" RocksDB builds reading files using a mix of built-in algorithms. As a hypothetical example, we might want to switch to lz4 on a machine that is under heavy CPU load and back to zstd when load is more normal. I dug up some data indicating ~30 seconds per output file in compaction, suggesting that file-level responsiveness might be too slow. This agility is perhaps more useful with disaggregated storage, where there is more flexibility in DB storage footprint and potentially more payoff in optimizing the *average* footprint.
In support of this direction, I have added a backdoor capability for debug builds of `ldb` to generate files with a mix of compression algorithms and incorporated this into the format compatibility test. All of the existing "forward compatible" versions (currently back to 8.6) are able to read the files generated with "mixed" compression. (NOTE: there's no easy way to patch a bunch of old versions to have them support generating mixed compression files, but going forward we can auto-detect builds with this "mixed" capability.) A subtle aspect of this support that is that for proper handling of decompression contexts and digested dictionaries, we need to set the `compression_name` table property to `zstd` if any blocks are zstd compressed. I'm expecting to add better info to SST files in follow-up, but this approach here gives us forward compatibility back to 8.6.
However, in the spirit of opening things up with what makes sense under the existing schema, we only support one compression dictionary per file. It will be used by any/all algorithms that support dictionary compression. This is not outrageous because it seems standard that a dictionary is *or can be* arbitrary data representative of what will be compressed. This means we would need a schema change to add dictionary compression support to an existing built-in compression algorithm (because otherwise old versions and new versions would disagree on whether the data dictionary is needed with that algorithm; this could take the form of a new built-in compression type, e.g. `kSnappyCompressionWithDict`; only snappy, bzip2, and windows-only xpress compression lack dictionary support currently).
Looking ahead to supporting custom compression, exposing a sizeable set of CompressionTypes to the user for custom handling essentially guarantees a path for the user to put *versioning* on their compression even if they neglect that initially, and without resorting to managing a bunch of distinct named entities. (I'm envisioning perhaps 64 or 127 CompressionTypes open to customization, enough for ~weekly new releases with more than a year of horizon on recycling.)
More details:
* Reduce the running time (CI cost) of the default format compatibility test by randomly sampling versions that aren't the oldest in a category. AFAIK, pretty much all regressions can be caught with the even more stripped-down SHORT_TEST.
* Configurable make parallelism with J environment variable
* Generate data files in a way that makes them much more eligible for index compression, e.g. bigger keys with less entropy
* Generate enough data files
* Remove 2.7.fb.branch from list because it shows an assertion violation when involving compression.
* Randomly choose a contiguous subset of the compression algorithms X {dictionary, no dictionary} configuration space when generating files, with a number of files > number of algorithms. This covers all the algorithms and both dictionary/no dictionary for each release (but not in all combinations).
* Have `ldb` fail if the specified compression type is not supported by the build.
Other future work needed:
* Blob files in format compatibility test, and support for mixed compression. NOTE: the blob file schema should naturally support mixing compression algorithms but the reader code does not because of an assertion that the block CompressionType (if not no compression) matches the whole file CompressionType. We might introduce a "various" CompressionType for this whole file marker in blob files.
* Do more to ensure certain features and code paths e.g. in the scripts are actually used in the compatibility test, so that they aren't accidentally neutralized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13414
Test Plan: Manual runs with some temporary instrumentation, also a recent revision of this change included a GitHub Actions run of the updated format compatible test: https://github.com/facebook/rocksdb/actions/runs/13463551149/job/37624205915?pr=13414
Reviewed By: hx235
Differential Revision: D70012056
Pulled By: pdillinger
fbshipit-source-id: 9ea5db76ba01a95338ed1a86b0edd71a469c4061
Summary:
added merge support for WBWIMemTable. Most of the preparation work is done in https://github.com/facebook/rocksdb/issues/13387 and https://github.com/facebook/rocksdb/issues/13400. The main code change to support merge is in wbwi_memtable.cc to support reading the Merge value type. The rest of the changes are mostly comment change and tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13410
Test Plan:
- new unit test
- ran `python3 ./tools/db_crashtest.py --txn blackbox --txn_write_policy=0 --commit_bypass_memtable_one_in=100 --test_batches_snapshots=0 --use_merge=1` for several runs.
Reviewed By: jowlyzhang
Differential Revision: D69885868
Pulled By: cbi42
fbshipit-source-id: b127d95a3027dc35910f6e5d65f3409ba27e2b6b
Summary:
... to ensure proper cache charging. However, this is a somewhat hazardous combination if there are many CFs and could be the target of future work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13398
Test Plan: this is the test
Reviewed By: hx235
Differential Revision: D69619977
Pulled By: pdillinger
fbshipit-source-id: 9841768584e4688d8fdd0258f3ba9608b67408e5
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13411
We should intialize statuses with OK rather than IOError to correctly handle cases
like NotFound due to bloom filter. In case of IOError status would be updated
appropriately by the reader
Reviewed By: anand1976
Differential Revision: D69886976
fbshipit-source-id: 92b130168f23633224ff4153bfe46a7d86482b90
Summary:
This is a preparation for supporting merge in `WBWIMemTable`. This PR updates the sequence number assignment method so that it allows efficient and simple assignment when there are multiple entries with the same user key. This can happen when the WBWI contains Merge operations. This assignment relies on tracking the number of updates issued for each key in each WBWI entry (`WriteBatchIndexEntry::update_count`). Some refactoring is done in WBWI to remove `last_entry_offset` as part of the WBWI state which I find it harder to use correctly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13400
Test Plan: updated unit tests to check that update count is tracked correctly and WBWIMemTable is assigning sequence number as expected.
Reviewed By: pdillinger
Differential Revision: D69666462
Pulled By: cbi42
fbshipit-source-id: 9b18291825017a67c4da3318e8a556aa2971326b
Summary:
This PR introduces an interface to plug in an external table file reader into RocksDB. The external table reader may support custom file formats that might work better for a specific use case compared to RocksDB native formats. This initial version allows the external table file to be loaded and queried using an `SstFileReader`. In the near future, we will allow it to be used with a limited RocksDB instance that allows bulkload but not live writes.
The model of a DB using an external table reader is a read only database allowing bulkload and atomic replace in the bottommost level only. Live writes, if supported in the future, are expected to use block based table files in higher levels. Tombstones, merge operands, and non-zero sequence numbers are expected to be present only in non-bottommost levels. External table files are assumed to have only Puts, and all keys implicitly have sequence number 0.
TODO (in future PRs) -
1. Add support for external file ingestion, with safety mechanisms to prevent accidental writes
2. Add support for atomic column family replace
3. Allow custom table file extensions
4. Add a TableBuilder interface for use with `SstFileWriter`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13401
Reviewed By: pdillinger
Differential Revision: D69689351
Pulled By: anand1976
fbshipit-source-id: c5d5b92d56fd4d0fc43a77c4ceb0463d4f479bda
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13403
Add MultiGet support in SstReader. Today we only have iteration support and this change
also adds MultiGet support to SstFileReader if some application wants to use it.
Reviewed By: anand1976
Differential Revision: D69514499
fbshipit-source-id: 20e85a4bd13a3a9f45dacb223c1a4541fb87f561
Summary:
Noticed that the `do_merge` parameter is not properly set while working on memtable code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13396
Test Plan: updated unit test for the read-only db case.
Reviewed By: jaykorean
Differential Revision: D69505015
Pulled By: cbi42
fbshipit-source-id: d4c64ca7bba31fe26aa41a29cbc55835d9f1f116
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/13263 and https://github.com/facebook/rocksdb/pull/13360 disabled `track_and_verify_wals` with some injection under TXN temporarily but recent stress tests has found more issues this feature surfaced even with the previous disabling. Disabling the feature **completely** now for stabilizing CI while debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13405
Test Plan: Monitor CI
Reviewed By: cbi42
Differential Revision: D69759276
Pulled By: hx235
fbshipit-source-id: 501a3561acb9daa834f874095f9a66ae6ae5aa42
Summary:
**Context/Summary:**
It's [documented (affcad0cc9/db/job_context.h (L230)) that `// For non-empty JobContext Clean() has to be called at least once before before destruction`. This is violated in a UT accidentally so causing the assertion failure `assert(logs_to_free.size() == 0);` in` ~JobContext`. This PR is to fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13406
Test Plan: Monitor for future UT assertion failure in `TEST_F(DBWALTest, FullPurgePreservesRecycledLog) `
Reviewed By: cbi42
Differential Revision: D69759725
Pulled By: hx235
fbshipit-source-id: dd1617b370a2c69daba657287dcf258542f92ef5
Summary:
**Context/Summary:**
02b4197544 recently added the ability to detect WAL hole presents in the predecessor WAL. It forgot to update the corrupted wal number to point to the predecessor WAL in that corruption case. This PR fixed it.
As a bonus, this PR also (1) fixed the `FragmentBufferedReader()` constructor API to expose less parameters as they are never explicitly passed in in the codebase (2) a INFO log wording (3) a parameter naming typo (4) the reporter naming
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13359
Test Plan:
1. Manual printing to ensure the corrupted wal number is set to the right number
2. Existing UTs
Reviewed By: jowlyzhang
Differential Revision: D69068089
Pulled By: hx235
fbshipit-source-id: f7f8a887cded2d3a26cf9982f5d1d1ab6a78e9e1
Summary:
**Context/Summary:**
Secondary DB relies on open file descriptor of the shared SST file in primary DB to continue being able to read the file even if that file is deleted in the primary DB. However, this won't work if the file is truncated instead of deleted, which triggers an "truncated block read" corruption in stress test on secondary db reads. Truncation can happen if RocksDB implementation of SSTFileManager and `bytes_max_delete_chunk>0` are used. This PR is to disable such testing combination in stress test and clarify the related API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13395
Test Plan:
- Manually repro-ed with below UT. I'm in favor of not including this UT in the codebase as it should be self-evident from the API comment now about the incompatiblity. Secondary DB is in a direction of being replaced by Follower so we should minimize edge-case tests for code with no functional change for a to-be-replaced functionality.
```
TEST_F(DBSecondaryTest, IncompatibleWithPrimarySSTTruncation) {
Options options;
options.env = env_;
options.disable_auto_compactions = true;
options.sst_file_manager.reset(NewSstFileManager(
env_, nullptr /*fs*/, "" /*trash_dir*/, 2024000 /*rate_bytes_per_sec*/,
true /*delete_existing_trash*/, nullptr /*status*/,
0.25 /*max_trash_db_ratio*/, 1129 /*bytes_max_delete_chunk*/));
Reopen(options);
ASSERT_OK(Put("key1", "old_value"));
ASSERT_OK(Put("key2", "old_value"));
ASSERT_OK(Flush());
ASSERT_OK(Put("key1", "new_value"));
ASSERT_OK(Put("key3", "new_value"));
ASSERT_OK(Flush());
Options options1;
options1.env = env_;
options1.max_open_files = -1;
Reopen(options);
OpenSecondary(options1);
ASSERT_OK(db_secondary_->TryCatchUpWithPrimary());
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DeleteScheduler::DeleteTrashFile:Fsync", [&](void*) {
std::string value;
Status s = db_secondary_->Get(ReadOptions(), "key2", &value);
assert(s.IsCorruption());
assert(s.ToString().find("truncated block read") !=
std::string::npos);
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
}
```
- Monitor future stress test
Reviewed By: jowlyzhang
Differential Revision: D69499694
Pulled By: hx235
fbshipit-source-id: 57525b9841897f42aecb758a4d3dd3589367dcd9
Summary:
# Problem
Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly `Refresh'ed`, the iterator will hold on to the `Init`-time assigned `SuperVersion` throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (_potentially_ referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of _logical_ backups (outside of RocksDB `BackupEngine`).
# Solution
Building on top of the `Refresh(const Snapshot* snapshot)` API introduced in https://github.com/facebook/rocksdb/pull/10594, we're adding a new `ReadOptions` opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in `read_options.snapshot` at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13354
Test Plan:
**Correctness:** New test to demonstrate the auto refresh behavior in contrast to legacy iterator: `./db_iterator_test --gtest_filter=*AutoRefreshIterator*`.
**Stress testing:** We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in `db_stress` as we reasonably can with random feature on/off configuration in db_crashtest.py.
# Benchmarking
The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build, 3-5 times for each respective category or till convergence. In addition, we configured aggressive threshold of 1 second for new `Superversion` creation. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1].
**TLDR;** Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) in throughput relative to the baseline, if any, is no more than 1-2%.
**Snapshot initialization approach:**
This feature is only effective on iterators with well-defined `snapshot` passed via `ReadOptions` config. We modified the existing `db_bench` program to reflect that constraint. However, it quickly turned out that the actual `Snapshot*` initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking _potentially_ adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' `Snapshot*` for all the future scans _before_ running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be.
## "DB in memory" benchmark
**DB Setup**
1. Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected `000004.log` file is 541MB in size).
```
./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000 max_write_buffer_number=2 -compression_type=none
```
2. As a part of recovery in subsequent DB open, WAL will be processed to one or more SST files during the recovery. We're selecting a large block cache (`cache_size` parameter in `db_bench` script) suitable for holding the entire DB to test the “hot path” CPU overhead.
```
./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=1 -async_io=1 -num=? -seek_nexts=? -cache_size=? -write_buffer_size=? -auto_refresh_iterator_with_snapshot={0|1}
```
| seek_nexts=100; num=2,000,000 | seek_nexts = 20,000; num=50000 | seek_nexts = 400,000; num=2000
-- | -- | -- | --
baseline | 36362 (± 300) ops/sec, 928.8 (± 23) MB/s, 99.11% block cache hit | 52.5 (± 0.5) ops/sec, 1402.05 (± 11.85) MB/s, 99.99% block cache hit | 156.2 (± 6.3) ms / op, 1330.45 (± 54) MB/s, 99.95% block cache hit
auto refresh | 35775.5 (± 537) ops/sec, 926.65 (± 13.75) MB/s, 99.11% block cache hit | 53.5 (± 0.5) ops/sec, 1367.9 (± 9.5) MB/s, 99.99% block cache hit | 162 (± 4.14) ms / op, 1281.35 (± 32.75) MB/s, 99.95% block cache hit
_-cache_size=5000000000 -write_buffer_size=3200000000 -max_write_buffer_number=2_
| seek_nexts=3,500,000; num=100
-- | --
baseline | 1447.5 (± 34.5) ms / op, 1255.1 (± 30) MB/s, 98.98% block cache hit
auto refresh | 1473.5 (± 26.5) ms / op, 1232.6 (± 22.2) MB/s, 98.98% block cache hit
_-cache_size=17680000000 -write_buffer_size=14500000000 -max_write_buffer_number=2_
| seek_nexts=17,500,000; num=10
-- | --
baseline | 9.11 (± 0.185) s/op, 997 (± 20) MB/s
auto refresh | 9.22 (± 0.1) s/op, 984 (± 11.4) MB/s
[1]
### Specs
| Property | Value
-- | --
RocksDB | version 10.0.0
Date | Mon Feb 3 23:21:03 2025
CPU | 32 * Intel Xeon Processor (Skylake)
CPUCache | 16384 KB
Keys | 16 bytes each (+ 0 bytes user-defined timestamp)
Values | 100 bytes each (50 bytes after compression)
Prefix | 0 bytes
RawSize | 5.5 MB (estimated)
FileSize | 3.1 MB (estimated)
Compression | Snappy
Compression sampling rate | 0
Memtablerep | SkipListFactory
Perf Level | 1
Reviewed By: pdillinger
Differential Revision: D69122091
Pulled By: mszeszko-meta
fbshipit-source-id: 147ef7c4fe9507b6fb77f6de03415bf3bec337a8
Summary:
Options File Number to be read by remote worker is part of the `CompactionServiceInput`. We've been setting this in `ProcessKeyValueCompactionWithCompactionService()` while the db_mutex is not held. This needs to be accessed while the mutex is held. The value can change as part of `SetOptions() -> RenameTempFileToOptionsFile()` as in following.
e6972196bc/db/db_impl/db_impl.cc (L5595-L5596)
Keep this value in memory during `CompactionJob::Prepare()` which is called while the mutex is held, so that we can easily access this later without mutex when building the CompactionInput for the remote compaction.
Thanks to the crash test. This was surfaced after https://github.com/facebook/rocksdb/issues/13378 merged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13394
Test Plan:
Unit Test
```
./compaction_service_test
```
Crash Test
```
COERCE_CONTEXT_SWITCH=1 COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j100 dbg
```
```
python3 -u tools/db_crashtest.py blackbox --enable_remote_compaction=1
```
Reviewed By: jowlyzhang
Differential Revision: D69496313
Pulled By: jaykorean
fbshipit-source-id: 7e38e3cb75d5a7708beb4883e1a138e2b09ff837
Summary:
as a preparation to support merge in [WBWIMemtable](d48af21386/memtable/wbwi_memtable.h (L31)), this PR updates how we [order updates to the same key](d48af21386/utilities/write_batch_with_index/write_batch_with_index_internal.cc (L694-L697)) in WriteBatchWithIndex. Specifically, the order is now reversed such that more recent update is ordered first. This will make iterating from WriteBatchWithIndex much easier since the key ordering in WBWI now matches internal key order where keys with larger sequence number are ordered first. The ordering is now explicitly documented above the declaration for `WriteBatchWithIndex` class.
Places that use `WBWIIteratorImpl` and assume key ordering are updated. The rest is test and comments update.
This will affect users who use WBWIIterator directly, the output of GetFromBatch, GetFromBatchAndDB or NewIteratorWithBase are not affected. Users are only affected if they may issue multiple updates to the same key. If WriteBatchWithIndex is created with `overwrite_key=true`, one the the updates needs to be Merge.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13387
Test Plan: we have some good coverage of WBWI, I updated some existing tests and added a test for `WBWIIteratorImpl`.
Reviewed By: pdillinger
Differential Revision: D69421268
Pulled By: cbi42
fbshipit-source-id: d97eec4ee74aeac3937c9758041c7713f07f9676
Summary:
Motivated by code review issue in https://github.com/facebook/rocksdb/issues/13316, we don't want to release the DB mutex in SetOptions between updating the cfd latest options and installing the new Version and SuperVersion. SetOptions uses LogAndApply to install a new Version but this currently incurs an unnecessary manifest write. (This is not a big performance concern because SetOptions dumps a new OPTIONS file, which is much larger than the redundant manifest update.) Since we don't want IO while holding the DB mutex, we need to get rid of the manifest write, and that's what this change does. We introduce a kind of dummy VersionEdit that allows the existing code paths of LogAndApply to install a new Version (with the updated mutable options), recompute resulting compaction scores etc., but without the manifest write.
Part of the validation for this is new assertions in SetOptions verifying the consistency of the various copies of MutableCFOptions. (I'm not convinced we need it in SuperVersion in addition to Version, but that's not for here and now.) These checks depend on defaulted `operator==` so depend on C++20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13384
Test Plan:
New unit test in addition to new assertions. SetOptions already tested heavily in crash test. Used
`ROCKSDB_CXX_STANDARD=c++20 make -j100 check` to ensure the new assertions are verified
Reviewed By: cbi42
Differential Revision: D69408829
Pulled By: pdillinger
fbshipit-source-id: 4cf026010c6bb381e0ea27567cce2708d4678e7d