rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2025-04-19 16:30:36 +08:00

Author	SHA1	Message	Date
Jay Huh	cc487ba367	Fix Compaction Stats for Remote Compaction and Tiered Storage (#13464 ) Summary: ## Background Compaction statistics are collected at various levels across different classes and structs. * `InternalStats::CompactionStats`: Per-level Compaction Stats within a job (can be at subcompaction level which later get aggregated to the compaction level) * `InternalStats::CompactionStatsFull`: Contains two per-level compaction stats - `output_level_stats` for primary output level stats and `proximal_level_stats` for proximal level stats. Proximal level statistics are only relevant when using Tiered Storage with the per-key placement feature enabled. * `InternalStats::CompactionOutputsStats`: Simplified version of `InternalStats::CompactionStats`. Only has a subset of fields from `InternalStats::CompactionStats` * `CompactionJobStats`: Job-level Compaction Stats. (can be at subcompaction level which later get aggregated to the compaction level) Please note that some fields in Job-level stats are not in Per-level stats and they don't map 1-to-1 today. ## Issues * In non-remote compactions, proximal level compaction statistics were not being aggregated into job-level statistics. Job level statistics were missing stats for proximal level for tiered storage compactions with per-key-replacement feature enabled. * During remote compactions, proximal level compaction statistics were pre-aggregated into job-level statistics on the remote side. However, per-level compaction statistics were not part of the serialized compaction result, so that primary host lost that information and weren't able to populate `per_key_placement_comp_stats_` and `internal_stats_.proximal_level_stats` properly during the installation. * `TieredCompactionTest` was only checking if (expected stats > 0 && actual stats > 0) instead actual value comparison ## Fixes * Renamed `compaction_stats_` to `internal_stats_` for `InternalStats::CompactionStatsFull` in `CompactionJob` for better readability * Removed the usage of `InternalStats::CompactionOutputsStats` and consolidated them to `InternalStats::CompactionStats`. * Remote Compactions now include the internal stats in the serialized `CompactionServiceResult`. `output_level_stats` and `proximal_level_stats` get later propagated in sub_compact output stats accordingly. * `CompactionJob::UpdateCompactionJobStats()` now takes `CompactionStatsFull` and aggregates the `proximal_level_stats` as well * `TieredCompactionTest` is now doing the actual value comparisons for input/output file counts and record counts. Follow up is needed to do the same for the bytes read / written. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13464 Test Plan: Unit Tests updated to verify stats ``` ./compaction_service_test ``` ``` ./tiered_compaction_test ``` Reviewed By: pdillinger Differential Revision: D71220393 Pulled By: jaykorean fbshipit-source-id: ad70bffd9614ced683f90c7570a17def9b5c8f3f	2025-03-18 16:28:18 -07:00
Yu Zhang	17ac19f2c4	Add a check during recovery for proper seqno advancement (#13465 ) Summary: This PR adds a check for an invariant of sequence number during recovery, that it should not be set backward. This is inspired by a recent SEV that is caused by a software bug. It is a relatively cheap and straightforward check that RocksDB can do to avoid silently opening the DB in a corrupted state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13465 Test Plan: Existing tests should cover the case when the invariant is met The corrupted state is manually tested using aforementioned bug. Reviewed By: hx235 Differential Revision: D71226513 Pulled By: jowlyzhang fbshipit-source-id: cd8056fa6653d44ceeb9ba9b4693ab0660a53b4e	2025-03-17 12:49:10 -07:00
Hui Xiao	24952ff088	Expose number of L0 files in the CF right before the compaction starts in CompactionJobInfo (#13462 ) Summary: Context/Summary: For users who are interested in knowing how efficient their compaction in reducing L0 files or how bad their long-running compaction in "locking" L0 files, they now have a reference point "L0 files in the CF pre compaction" for their input compaction files. - Compared to the existing stats or exposing in some other way, exposing this info in CompactionJobInfo allows users to compare it with other compaction data (e.g, compaction input num, compaction reason) of within one compaction (of per-compaction granularity). - If this number is high while their "short-running" compaction has little L0 files input, then those compaction may have a room for improvement. Similar for those long-running compaction. This PR is to add a new field `CompactionJobInfo::num_l0_files_pre_compaction` for that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13462 Test Plan: - Piggyback on an existing test Reviewed By: jaykorean Differential Revision: D71124938 Pulled By: hx235 fbshipit-source-id: aa47c9c86c62d9425771b320f5636e50671fd289	2025-03-17 11:11:44 -07:00
Maciej Szeszko	6ac13a5f0a	Expose WriteLifeTimeHint at the FileOptions level (#13461 ) Summary: The original implementation of NVMe write lifetime hints (https://github.com/facebook/rocksdb/pull/3095) assumed a flexible interface which decouples file creation from the explicit act of setting write lifetime hint (see `PosixWritableFile` for more context). However, there are existing file systems implementations (ex. Warm Storage) that require all the options (including file write lifetime hints) to be specified once at the time of the actual `FSWritableFile` object instantiation. We're extending the `FileOptions` with `Env::WriteLifeTimeHint` and patch existing callsites accordingly to enable one-shot metadata setup for those more constraint implementations. NOTE: Today `CalculateSSTWriteHint` only sets write lifetime hint for Level compactions. We'll fill that gap in following PRs and add calculation for Universal Compactions which would unblock Zippy's use case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13461 Reviewed By: anand1976 Differential Revision: D71144645 Pulled By: mszeszko-meta fbshipit-source-id: 6c09b62a360d48bd6e4fb08a1265bce2a49f3f4a	2025-03-14 21:43:50 -07:00
Peter Dillinger	0cc943c067	format_version < 2 unsupported for write, deprecated for read (#13463 ) Summary: In hopes of eventually removing some ugly and awkard code for compress_format_version < 2, users can no longer write files in that format and its read support is marked deprecated. For continuing to test that read support, there is a back door to writing the files in unit tests. If format_version < 2 is specified, it is quietly sanitized to 2. (This is similar to other BlockBasedTableOptions.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13463 Test Plan: unit tests updated. Reviewed By: hx235 Differential Revision: D71152916 Pulled By: pdillinger fbshipit-source-id: 95be55e86f93f09fd898223578b9381385c3ccd8	2025-03-14 10:50:05 -07:00
Jay Huh	ca7367a003	Replace penultimate naming with proximal (#13460 ) Summary: With generalized age-based tiering (work-in-progress), the "warm tier" data will no longer necessarily be placed in the second-to-last level (also known as the "penultimate level"). Also, the cold tier may no longer necessarily be at the last level, so we need to rename options like `preclude_last_level_seconds` to `preclude_cold_tier_seconds`, but renaming options is trickier because it can be a breaking change for consuming applications. We will do this later as a follow up. Minor fix included: Fixed one `use-after-move` in CompactionPicker Pull Request resolved: https://github.com/facebook/rocksdb/pull/13460 Test Plan: CI Reviewed By: pdillinger Differential Revision: D71059486 Pulled By: jaykorean fbshipit-source-id: fd360cdf719e015bf9f9e3f6f1663438226566a4	2025-03-12 18:24:28 -07:00
Jay Huh	c5921df3d7	Add PerKeyPlacement support (#13459 ) Summary: This PR adds support for PerKeyPlacement in Remote Compaction. The `seqno_to_time_mapping` is already available from the table properties of the input files. `preserve_internal_time_seconds` and `preclude_last_level_data_seconds` are directly read from the OPTIONS file upon db open in the remote worker. The necessary changes include: - Add `is_penultimate_level_output` and `file_temperature` to the `CompactionServiceOutputFile` - When building the output for the remote compaction, get the outputs for penultimate level and last level separately, serialize them with the two additional information added in this PR. - When deserializing the result from the primary, SubcompactionState's `GetOutputs()` now takes `is_penultimate_level`. This allows us to determine which level to place the output file. - Include stats from `compaction_stats.penultimate_level_stats` in the remote compaction result # To Follow up - Stats to be fixed. Stats are not being populated correctly for PerKeyPlacement even for non-remote compactions. - Clean up / Reconcile the "penultimate" naming by replacing with "proximal" Pull Request resolved: https://github.com/facebook/rocksdb/pull/13459 Test Plan: Updated the unit test ``` ./compaction_service_test ``` Reviewed By: pdillinger Differential Revision: D71007211 Pulled By: jaykorean fbshipit-source-id: f926e56df17239875d849d46b8b940f8cd5f1825	2025-03-12 11:46:02 -07:00
Maciej Szeszko	8e16f8fecf	Reduce db stress noise (#13447 ) Summary: [Experiment] This PR is a followup to https://github.com/facebook/rocksdb/pull/13408. Thick bandaid of ignoring all injected read errors in context of periodic iterator auto refreshes in db stress proved to be effective. We confirmed our theory that errors are not a really a consequence / defect related to this new feature but rather due to subtle ways in which downstream code paths handle their respective IO failures. In this change we're replacing a thick 'ignore all IO read errors' bandaid in `no_batched_ops_stress` with a much smaller, targeted patches in obsolete files purge / delete codepaths, table block cache reader, table cache lookup to make sure we don't miss signal and ensure there's a single mechanism for ignoring error injection in db stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13447 Reviewed By: hx235 Differential Revision: D70794787 Pulled By: mszeszko-meta fbshipit-source-id: c5fcd4780d82357c407f53bf0bb22fc38f7bd277	2025-03-12 01:13:40 -07:00
Jay Huh	22ca6e5e68	Additional debug logging for InputFileCheck Failure (#13452 ) Summary: Add debug logging when the Wait() does not return `kSuccess` so that we can compare the version state that was printed by the logging added in https://github.com/facebook/rocksdb/issues/13427 upon InputFileCheck failure. # Test Plan CI + Tested with Temporary Change in Meta Internal Infra Pull Request resolved: https://github.com/facebook/rocksdb/pull/13452 Reviewed By: hx235 Differential Revision: D70898963 Pulled By: jaykorean fbshipit-source-id: d591b82f2df173b5e01f6552230844ce95155256	2025-03-10 13:37:47 -07:00
Richard Barnes	60c266658d	Use `nullptr` in infra_asic_fpga/ip/mtia/athena/main/models/cmodel/util/jsonUtils.cpp Summary: `nullptr` is preferable to `0` or `NULL`. Let's use it everywhere so we can enable `-Wzero-as-null-pointer-constant`. - If you approve of this diff, please use the "Accept & Ship" button :-) Reviewed By: dtolnay Differential Revision: D70818166 fbshipit-source-id: 4658fb004676fe2686249fdd8ecb322dec8aa63d	2025-03-09 11:18:56 -07:00
Peter Dillinger	b9c7481fc2	Fix some secondary/read-only DB logic (#13441 ) Summary: Primarily, fix an issue from https://github.com/facebook/rocksdb/issues/13316 with opening secondary DB with preserve/preclude option (crash test disable in https://github.com/facebook/rocksdb/issues/13439). The issue comes down to mixed-up interpretations of "read_only" which should now be resolved. I've introduced the stronger notion of "unchanging" which means the VersionSet never sees any changes to the LSM tree, and the weaker notion of "read_only" which means LSM tree changes are not written through this VersionSet/etc. but can pick up externally written changes. In particular, ManifestTailer should use read_only=true (along with unchanging=false) for proper handling of preserve/preclude options. A new assertion in VersionSet::CreateColumnFamily to help ensure sane usage of the two boolean flags is incompatible with the known wart of allowing CreateColumnFamily on a read-only DB. So to keep that assertion, I have fixed that issue by disallowing it. And this in turn required downstream clean-up in ldb, where I cleaned up some call sites as well. Also, rename SanitizeOptions for ColumnFamilyOptions to SanitizeCfOptions, for ease of search etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13441 Test Plan: * Added preserve option to a test in db_secondary_test, which reproduced the failure seen in the crash test. * Revert https://github.com/facebook/rocksdb/issues/13439 to re-enable crash test functionality * Update some tests to deal with disallowing CF creation on read-only DB * Add some testing around read-only DBs and CreateColumnFamily(ies) * Resurrect a nearby test for read-only DB to be sure it doesn't write to the DB dir. New EnforcedReadOnlyReopen should probably be used in more places but didn't want to attempt a big migration here and now. (Suggested follow-up.) Reviewed By: jowlyzhang Differential Revision: D70808033 Pulled By: pdillinger fbshipit-source-id: 486b4e9f9c9045150a0ebb9cb302753d03932a3f	2025-03-07 14:56:45 -08:00
Peter Dillinger	5d1c0a8832	Reformat assertion in TEST_VerifyNoObsoleteFilesCached (#13446 ) Summary: ... for better automatic failure grouping Pull Request resolved: https://github.com/facebook/rocksdb/pull/13446 Test Plan: no production code change Reviewed By: hx235 Differential Revision: D70789464 Pulled By: pdillinger fbshipit-source-id: 68263f6ed666349d65b5f493865973a213f35ec9	2025-03-07 11:25:44 -08:00
Jay Huh	d033c6a849	set ignore_unknown_options when parsing options (#13443 ) Summary: In case the primary host has a new option added which isn't available in the remote worker yet, the remote compaction currently fails. In most cases, these new options are not relevant to the remote compaction and the worker should be able to move on by ignoring it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13443 Test Plan: Verified internally in Meta Infra. Reviewed By: anand1976 Differential Revision: D70744359 Pulled By: jaykorean fbshipit-source-id: eb6a388c2358a7f8089f2e35a378b7017b9e03f3	2025-03-06 17:26:37 -08:00
Jay Huh	68b2d941be	Introduce kAborted Status (#13438 ) Summary: If compaction job needs to be aborted inside `Schedule()` or `Wait()` today (e.g. Primary host is shutting down), the only two options are the following - Handle it as failure by returning `CompactionServiceJobStatus::kFailure` - Return `CompactionServiceJobStatus::kUseLocal` and let the compaction move on locally and eventually succeed or fail depending on the timing In this PR, we are introducing a new status, `CompactionServiceJobStatus::kAborted`, so that the implementation of `Schedule()` and `Wait()` can return it. Just like how `CompactionServiceJobStatus::kFailure` is handled, compaction will not move on and fail, but the status will be returned as `Status::Aborted()` instead of `Status::Incomplete()` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13438 Test Plan: Unit Test added ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.AbortedWhileWait" ``` Reviewed By: anand1976, hx235 Differential Revision: D70655355 Pulled By: jaykorean fbshipit-source-id: 22614ce9c7455cda649b15465625edc93978fe11 v9.11.1	2025-03-05 22:15:17 -08:00
Andrew Chang	8e6d431153	Add IOActivityToString helper method (#13440 ) Summary: I have a place I want to use this helper method inside the Sally codebase. I have this functionality in my Sally diff right now, but I think it is generic enough to warrant putting alongside `Env::PriorityToString`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13440 Test Plan: Just the compiler and CI checks are sufficient IMO. Reviewed By: hx235 Differential Revision: D70664597 Pulled By: archang19 fbshipit-source-id: 341de6c6e311a3f421ad093c2c216e5caa5034dd	2025-03-05 19:07:01 -08:00
anand76	14c949df8b	Initial implementation of ExternalTableBuilder (#13434 ) Summary: This PR adds the ability to use an ExternalTableBuilder through the SstFileWriter to create external tables. This is a counterpart to https://github.com/facebook/rocksdb/issues/13401 , which adds the ExternalTableReader. The support for external tables is confined to ingestion only DBs, with external table files ingested into the bottommost level only. https://github.com/facebook/rocksdb/issues/13431 enforces ingestion only DBs by adding a disallow_memtable_writes column family option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13434 Test Plan: New unit tests in table_test.cc Reviewed By: pdillinger Differential Revision: D70532054 Pulled By: anand1976 fbshipit-source-id: a837487eadfabed9627a0eceb403bfc5fc2c427c	2025-03-05 16:30:46 -08:00
anand76	f6bff87b92	Add opaque options in ReadOptions for external tables (#13436 ) Summary: Add an unordered_map of name/value pairs in ReadOptions::property_bag, similar to IOOptions::property_bag. It allows users to pass through some custom options to an external table. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13436 Reviewed By: jaykorean Differential Revision: D70649609 Pulled By: anand1976 fbshipit-source-id: 9b14806a9f3599b861827bd4ae6e948861edc51a	2025-03-05 16:25:41 -08:00
Peter Dillinger	ec8f1452f5	Temp disable in crash test: secondary instance + seqno-time tracking (#13439 ) Summary: PR https://github.com/facebook/rocksdb/issues/13316 broke some crash test cases in DBImplSecondary, from combining test_secondary=1 and preserve_internal_time_seconds>0. Disabling that while investigating the fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13439 Test Plan: manual blackbox_crash_test runs with forced test_secondary=1 Reviewed By: anand1976 Differential Revision: D70656373 Pulled By: pdillinger fbshipit-source-id: fa2139e90bbe64ec8ebb062877d9337894ea3b43	2025-03-05 14:32:05 -08:00
Peter Dillinger	15873b1fdd	New CF option disallow_memtable_writes (#13431 ) Summary: ... to better support "ingestion only" column families such as those using an external file reader as in https://github.com/facebook/rocksdb/issues/13401. It would be possible to implement this by getting rid of the memtable for that CF, but it quickly because clear that such an approach would need to update a lot of places to deal with such a possibility. And we already have logic to optimize reads when a memtable is empty. We put a vector memtable in place to minimize overheads of an empty memtable. There are three layers of defense against writes to the memtable: * WriteBatch ops to a disallowed CF will fail immediately, without waiting for Write(). For this check to work, we need a ColumnFamilyHandle and because of that, we don't support disallow_memtable_writes on the default column family. * MemtableInserter will reject writes to disallowed CFs. This is needed to protect re-open with disallow when there are existing writes in a WAL. * The placeholder memtable is marked immutable. This will cause an assertion failure on attempt to write, such as in case of bug or regression. Suggested follow-up: * Remove the limitation on using the option with the default column family, perhaps by solving https://github.com/facebook/rocksdb/issues/13429 more generally or perhaps with some specific check before the first memtable write of the batch (but potential CPU overhead for such a check - there's likely optimization opportunities around ColumnFamilyMemTables). Pull Request resolved: https://github.com/facebook/rocksdb/pull/13431 Test Plan: unit tests added Performance: A db_bench call designed to realistically focus on the CPU cost of writes: ``` ./db_bench -db=/dev/shm/dbbench1 --benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -num_column_families=20 -disable_wal -write_buffer_size=1234000 ``` Running before & after tests at the same time on the same machine, 40 iterations each, average ops/s, DEBUG_LEVEL=0, remove slowest run of each: Before: 772466 After: 773785 (0.2% faster) Likely within the noise, as if there was any change, we would expect a slight regression. Reviewed By: anand1976 Differential Revision: D70495936 Pulled By: pdillinger fbshipit-source-id: 306f7e737f87c1fbb52c5805f3cadb6e8ced9b40	2025-03-04 18:33:52 -08:00
Peter Dillinger	da8eba8b49	Improve consistency of SeqnoToTime tracking in SuperVersion (#13316 ) Summary: This is an unexpectedly complex follow-up to https://github.com/facebook/rocksdb/issues/13269. This change solves (and detects regressed) inconsistencies between whether a CF's SuperVersion is configured with a preserve/preclude option and whether it gets a usable SeqnoToTimeMapping. Operating with preserve/preclude and no usable mapping is degraded functionality we need to avoid. And no mapping is useful for actually disabling the feature (except with respect to existing SST files, but that's less of a concern for now). The challenge is that how we maintain the DB's SeqnoToTimeMapping can depend on all the column families, and we don't want to iterate over all column families for each column family (e.g. on initially creating each). The existing code was a bit relaxed: * On initially creating or re-configuring a CF, we might install an empty mapping, but soon thereafter (after releasing and re-acquiring the DB mutex) re-install another SuperVersion with a useful mapping. The solution here is to refactor the logic so that there's a distinct but related workflow for (a) ensuring a quality set of mappings when we might only be considering a single CF (`EnsureSeqnoToTimeMapping()`), and (b) massaging that set of mappings to account for all CFs (`RegisterRecordSeqnoTimeWorker`) which doesn't need to re-install new SuperVersions because each CF already has good mappings and will get updated SuperVersions when the periodic task adds new mappings. This should eliminate the extra SuperVersion installs associated with preserve/preclude on CF creation or re-configure, making it the same as any other CF. Some more details: * Some refactorings such as removing new_seqno_to_time_mapping from SuperVersionContext. (Now use parameter instead of being stateful.) * Propagate `read_only` aspect of DB to more places so that we can pro-actively disable preserve/preclude on read-only DBs, so that we don't run afoul of the assertion expecting SeqnoToTime entries. * Introduce a utility struct `MinAndMaxPreserveSeconds` for aggregating preserve/preclude settings in a useful way, sometimes on one CF and sometimes across multiple CFs. Much cleaner! (IMHO) * Introduce a function `InstallSuperVersionForConfigChange` that is a superset of `InstallSuperVersionAndScheduleWork` for when a CF is new or might have had a change to its mutable options. * Eliminate redundant re-install SuperVersions of created "missing" CFs in DBImpl::Open. Intended follow-up: * Ensure each flush has an "upper bound" SeqnoToTime entry, which would resolve a FIXME in tiered_compaction_test, but causes enough test churn to deserve its own PR + investigation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13316 Test Plan: This change is primarily validated by a new assertion in SuperVersion::Init to ensure consistency between (a) presence of any SeqnoToTime mappings in the SuperVersion and (b) preserve/preclude option being currently set. One unit test update was needed because we now ensure at least one SeqnoToTime entry is created on any DB::Open with preserve/preclude, so that there is a lower bound time on all the future data writes. This required a small hack in associating the time with Seqno 1 instead of 0, which is reserved for "unspecified old." Reviewed By: cbi42 Differential Revision: D70540638 Pulled By: pdillinger fbshipit-source-id: bb419fdbeb5a1f115fc429c211f9b8efaf2f56d7	2025-03-04 17:44:01 -08:00
Nicolas De Carli	5f9b7ccce3	Add ROCKSDB_AUXV_GETAUXVAL_PRESENT flag to defs.bzl (#13435 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13435 We've noticed the default CRC32c function gets executed when running on aarch64 cpus within our servers Issue is that ROCKSDB_AUXV_GETAUXVAL_PRESENT evaluates to false This fix enables the flag internally and reverts the previous fix, landed with D70423483 Reviewed By: pdillinger Differential Revision: D70584250 fbshipit-source-id: 28e41316187c474fdfaf854f301ad14b6721fcad	2025-03-04 16:51:19 -08:00
Sean Ovens	0c7e5bd2f0	Shrink size of HashSkipList buckets from 56B to 48B (#13424 ) Summary: Previous order of fields in SkipList: `const uint16_t kMaxHeight_; // 2B` `const uint16_t kBranching_; // 2B` `const uint32_t kScaledInverseBranching_; // 4B` `Comparator const compare_; // 8B` `Allocator* const allocator_; // 8B` `Node* const head_; // 8B` `std::atomic<int> max_height_; // 4B` `// 4B padding added automatically for alignment` `Node** prev_; // 8B` `int32_t prev_height_; // 4B` `// 4B padding added automatically for alignment` = 56B in total. By swapping prev_ and prev_height_, we get the following: `const uint16_t kMaxHeight_; // 2B` `const uint16_t kBranching_; // 2B` `const uint32_t kScaledInverseBranching_; // 4B` `Comparator const compare_; // 8B` `Allocator* const allocator_; // 8B` `Node* const head_; // 8B` `std::atomic<int> max_height_; // 4B` `int32_t prev_height_; // 4B` `Node** prev_; // 8B` = 48B in total. So this change saves 8B per SkipList object. When allocated using AllocateAligned (as is the case for the [hash skiplist](https://github.com/facebook/rocksdb/blob/main/memtable/hash_skiplist_rep.cc#L243)) and assuming alignof(std::max_align_t) = 16, this change saves an additional 8B per SkipList object (so 16B in total). Note: this does not affect the "skiplist" memtable, which internally uses InlineSkipList Pull Request resolved: https://github.com/facebook/rocksdb/pull/13424 Reviewed By: cbi42 Differential Revision: D70423252 Pulled By: pdillinger fbshipit-source-id: 450dcc7f0e9e86cd3481f6930e83eea5fef78b97	2025-03-03 21:25:29 -08:00
Changyu Bi	7e272d2032	Update MultiGet to provide consistent CF view for kPersistedTier (#13433 ) Summary: when reading with ReadOptions::read_tier = kPersistedTier and with a snapshot, MultiGet allows the case where some CF is read before a flush and some CF is read after the flush. This is not desirable, especially when atomic_flush is enabled and users use MultiGet to do some consistency checks on the data in SST files. This PR updates the code path for SuperVersion acquisition to get a consistent view across when kPersistedTier is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13433 Test Plan: a new unit test that could be flaky without this change. Reviewed By: jaykorean Differential Revision: D70509688 Pulled By: cbi42 fbshipit-source-id: 80de96f94407af9bb2062b6a185c61f65827c092	2025-03-03 15:21:10 -08:00
Nicolas De Carli	1d6c33d2a5	Enable hardware accelerated crc32c for ARM on Linux (#13432 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13432 We've noticed the default CRC32c function gets executed when running on aarch64 cpus within our servers Issue is that ROCKSDB_AUXV_GETAUXVAL_PRESENT evaluates to false This fix allows the usage of hardware-accelerated crc32 within our fleet Reviewed By: jaykorean Differential Revision: D70423483 fbshipit-source-id: 601da3fbf156e3e40695eb76ee5d37f67f83d427	2025-03-02 08:05:21 -08:00
Peter Dillinger	ebaeb03648	Write failure can be permanently fatal and break WriteBatch atomicity (#13428 ) Summary: This adds a test that attempts DeleteRange() with PlainTable (not supported) and shows that it not only puts the DB in failed write mode, it (a) breaks WriteBatch atomicity for readers, because they can see just part of a failed WriteBatch, and (b) makes the DB not recoverable (without manual intervention) if using WAL. Note: WriteBatch atomicity is not clearly documented but indicated at the top of write_batch.h and the wiki page for Transactions, even without Transactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13428 Test Plan: this is the test Reviewed By: anand1976 Differential Revision: D70332226 Pulled By: pdillinger fbshipit-source-id: 67bc4de68833a80578e48baa9d3a4f23f1600f3c	2025-02-27 11:37:56 -08:00
Jay Huh	d1f383b8eb	Add Logging for debugging InputFileCheck Failure (#13427 ) Summary: Add detailed log for debugging purpose Pull Request resolved: https://github.com/facebook/rocksdb/pull/13427 Test Plan: CI Reviewed By: cbi42, hx235 Differential Revision: D70274613 Pulled By: jaykorean fbshipit-source-id: de4bc61853136b923aa786717e7979be8886b9bd	2025-02-26 15:31:47 -08:00
Peter Dillinger	3af905aa68	Format compatibility test cover compressions, including mixed (#13414 ) Summary: The existing format compatibility test had limited coverage of compression options, particularly newer algorithms with and without dictionary compression. There are some subtleties that need to remain consistent, such as index blocks potentially being compressed but not using the file's dictionary if they are. This involves detecting (with a rough approximation) builds with the appropriate capabilities. The other motivation for this change is testing some potentially useful reader-side functionality that has been in place for a long time but has not been exercised until now: mixing compressions in a single SST file. The block-based SST schema puts a compression marker on each block; arguably this is for distinguishing blocks compressed using the algorithm stored in compression_name table property from blocks left uncompressed, e.g. because they did not reach the threshold of useful compression ratio, but the marker can also distinguish compression algorithms / decompressors. As we work toward customizable compression, it seems worth unlocking the capability to leverage the existing schema and SST reader-side support for mixing compression algorithms among the blocks of a file. Yes, a custom compression could implement its own dynamic algorithm chooser with its own tag on the compressed data (e.g. first byte), but that is slightly less storage efficient and doesn't support "vanilla" RocksDB builds reading files using a mix of built-in algorithms. As a hypothetical example, we might want to switch to lz4 on a machine that is under heavy CPU load and back to zstd when load is more normal. I dug up some data indicating ~30 seconds per output file in compaction, suggesting that file-level responsiveness might be too slow. This agility is perhaps more useful with disaggregated storage, where there is more flexibility in DB storage footprint and potentially more payoff in optimizing the average footprint. In support of this direction, I have added a backdoor capability for debug builds of `ldb` to generate files with a mix of compression algorithms and incorporated this into the format compatibility test. All of the existing "forward compatible" versions (currently back to 8.6) are able to read the files generated with "mixed" compression. (NOTE: there's no easy way to patch a bunch of old versions to have them support generating mixed compression files, but going forward we can auto-detect builds with this "mixed" capability.) A subtle aspect of this support that is that for proper handling of decompression contexts and digested dictionaries, we need to set the `compression_name` table property to `zstd` if any blocks are zstd compressed. I'm expecting to add better info to SST files in follow-up, but this approach here gives us forward compatibility back to 8.6. However, in the spirit of opening things up with what makes sense under the existing schema, we only support one compression dictionary per file. It will be used by any/all algorithms that support dictionary compression. This is not outrageous because it seems standard that a dictionary is or can be arbitrary data representative of what will be compressed. This means we would need a schema change to add dictionary compression support to an existing built-in compression algorithm (because otherwise old versions and new versions would disagree on whether the data dictionary is needed with that algorithm; this could take the form of a new built-in compression type, e.g. `kSnappyCompressionWithDict`; only snappy, bzip2, and windows-only xpress compression lack dictionary support currently). Looking ahead to supporting custom compression, exposing a sizeable set of CompressionTypes to the user for custom handling essentially guarantees a path for the user to put versioning on their compression even if they neglect that initially, and without resorting to managing a bunch of distinct named entities. (I'm envisioning perhaps 64 or 127 CompressionTypes open to customization, enough for ~weekly new releases with more than a year of horizon on recycling.) More details: * Reduce the running time (CI cost) of the default format compatibility test by randomly sampling versions that aren't the oldest in a category. AFAIK, pretty much all regressions can be caught with the even more stripped-down SHORT_TEST. * Configurable make parallelism with J environment variable * Generate data files in a way that makes them much more eligible for index compression, e.g. bigger keys with less entropy * Generate enough data files * Remove 2.7.fb.branch from list because it shows an assertion violation when involving compression. * Randomly choose a contiguous subset of the compression algorithms X {dictionary, no dictionary} configuration space when generating files, with a number of files > number of algorithms. This covers all the algorithms and both dictionary/no dictionary for each release (but not in all combinations). * Have `ldb` fail if the specified compression type is not supported by the build. Other future work needed: * Blob files in format compatibility test, and support for mixed compression. NOTE: the blob file schema should naturally support mixing compression algorithms but the reader code does not because of an assertion that the block CompressionType (if not no compression) matches the whole file CompressionType. We might introduce a "various" CompressionType for this whole file marker in blob files. * Do more to ensure certain features and code paths e.g. in the scripts are actually used in the compatibility test, so that they aren't accidentally neutralized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13414 Test Plan: Manual runs with some temporary instrumentation, also a recent revision of this change included a GitHub Actions run of the updated format compatible test: https://github.com/facebook/rocksdb/actions/runs/13463551149/job/37624205915?pr=13414 Reviewed By: hx235 Differential Revision: D70012056 Pulled By: pdillinger fbshipit-source-id: 9ea5db76ba01a95338ed1a86b0edd71a469c4061	2025-02-25 00:12:34 -08:00
Changyu Bi	3740fccc4b	Update for next release 10.0 (#13417 ) Summary: Updated version, HISTORY, compatibility script and folly hash for 10.0 release. Included a HISTORY.md update backported from 10.0 branch: c1f63e16f0def68ad83209700d486889ae86d8ce. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13417 Test Plan: CI Reviewed By: hx235 Differential Revision: D70029393 Pulled By: cbi42 fbshipit-source-id: f8276bb31cc69648b47e0cbcd728d2a33fbf531f	2025-02-24 13:28:20 -08:00
Changyu Bi	4c975a7c22	Disable flaky unit test `RoundRobinSubcompactionsAgainstPressureToken` (#13416 ) Summary: The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/13417174378/job/37480755623?fbclid=IwZXh0bgNhZW0CMTEAAR2pj4E1ua6zMxz4FxnPAPLIz011t1ddjaWPbmFlldfSG7dZGjWGVy-mDkg_aem_40kU2iCmcN93WsmzLZxGsA) and my previous [fix](https://github.com/facebook/rocksdb/pull/13347) did not seem to work. It's likely a test set up issue and disable the test for now since RoundRobin compaction style is not used to reduce some test failure noise. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13416 Test Plan: CI Reviewed By: hx235 Differential Revision: D70002097 Pulled By: cbi42 fbshipit-source-id: afe0f56363501dab2c9dc297bfbe0dff0ac6aeb3	2025-02-21 12:55:29 -08:00
Maciej Szeszko	5139ff5c29	Conditional check reordering (#13415 ) Summary: This change is addressing a valid concern raised in https://github.com/facebook/rocksdb/pull/13408#discussion_r1966000661. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13415 Test Plan: Existing test collateral. Reviewed By: cbi42 Differential Revision: D69999071 Pulled By: mszeszko-meta fbshipit-source-id: 5ebb195b2b83701e06c33bfcb19c57d9ac1c1dc6	2025-02-21 12:42:14 -08:00
Changyu Bi	d7aea6955c	Fix stress test DB verification methods (#13409 ) Summary: update VerifyDB() to respect user specified flags when choosing verification method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13409 Test Plan: existing CI. Reviewed By: hx235 Differential Revision: D69885644 Pulled By: cbi42 fbshipit-source-id: bbaa931cece3525f00d775639ec7b63ff0101d94	2025-02-21 10:50:01 -08:00
Changyu Bi	3c2c2689b9	Merge support in `WBWIMemTable` (#13410 ) Summary: added merge support for WBWIMemTable. Most of the preparation work is done in https://github.com/facebook/rocksdb/issues/13387 and https://github.com/facebook/rocksdb/issues/13400. The main code change to support merge is in wbwi_memtable.cc to support reading the Merge value type. The rest of the changes are mostly comment change and tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13410 Test Plan: - new unit test - ran `python3 ./tools/db_crashtest.py --txn blackbox --txn_write_policy=0 --commit_bypass_memtable_one_in=100 --test_batches_snapshots=0 --use_merge=1` for several runs. Reviewed By: jowlyzhang Differential Revision: D69885868 Pulled By: cbi42 fbshipit-source-id: b127d95a3027dc35910f6e5d65f3409ba27e2b6b	2025-02-20 20:21:45 -08:00
Maciej Szeszko	1e1c199316	Fix dbstress run - attempt 1 (#13408 ) Summary: This PR attempts to fix the dbstress failures post https://github.com/facebook/rocksdb/pull/13354. There are at least 2 high level categories of errors: 1) likely caused by wide-scope snapshot initialization ([issue found by Peter](https://github.com/facebook/rocksdb/pull/13354#discussion_r1960177118)), 2) lack of proper error propagation. Wrt 2), part of the problem is a real miss (we should condition auto refresh on `status().ok()` after calling to `Next` / `Prev`), but another part - [failure in propagating dbstress-injected read error](https://github.com/facebook/rocksdb/pull/13354#discussion_r1960913871) in file deletion is expected and should not be asserted on in dbstress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13408 Test Plan: Confirmed there are no more errors after running sandcastle crashtest for each of the failing flavors: ```hcl https://www.internalfb.com/sandcastle/workflow/252201579138859344 https://www.internalfb.com/sandcastle/workflow/3233584532458171962 https://www.internalfb.com/sandcastle/workflow/1283525893806766134 https://www.internalfb.com/sandcastle/workflow/2796735368603351293 https://www.internalfb.com/sandcastle/workflow/3792030886252148966 https://www.internalfb.com/sandcastle/workflow/67553994428973733 https://www.internalfb.com/sandcastle/workflow/3886606478427208295 https://www.internalfb.com/sandcastle/workflow/1684346260642682928 https://www.internalfb.com/sandcastle/workflow/4197354852715406516 https://www.internalfb.com/sandcastle/workflow/535928355663233170 https://www.internalfb.com/sandcastle/workflow/3409224917925569737 ``` Reviewed By: cbi42 Differential Revision: D69869766 Pulled By: mszeszko-meta fbshipit-source-id: 7a5b121218fb1dc0a37887d6fe2a5c07e2b894cf	2025-02-20 11:07:01 -08:00
Peter Dillinger	836e88ab7a	Add test for memtable bloom filter with WriteBufferManager (#13398 ) Summary: ... to ensure proper cache charging. However, this is a somewhat hazardous combination if there are many CFs and could be the target of future work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13398 Test Plan: this is the test Reviewed By: hx235 Differential Revision: D69619977 Pulled By: pdillinger fbshipit-source-id: 9841768584e4688d8fdd0258f3ba9608b67408e5	2025-02-20 10:16:12 -08:00
Sarang Masti	129b7791f9	Bugfix: Ensure statuses are initialized with OK() in SSTFileReader::MultiGet (#13411 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13411 We should intialize statuses with OK rather than IOError to correctly handle cases like NotFound due to bloom filter. In case of IOError status would be updated appropriately by the reader Reviewed By: anand1976 Differential Revision: D69886976 fbshipit-source-id: 92b130168f23633224ff4153bfe46a7d86482b90	2025-02-19 19:38:53 -08:00
Changyu Bi	36838bbf51	Update sequence number assignment method in `WBWIMemTable` (#13400 ) Summary: This is a preparation for supporting merge in `WBWIMemTable`. This PR updates the sequence number assignment method so that it allows efficient and simple assignment when there are multiple entries with the same user key. This can happen when the WBWI contains Merge operations. This assignment relies on tracking the number of updates issued for each key in each WBWI entry (`WriteBatchIndexEntry::update_count`). Some refactoring is done in WBWI to remove `last_entry_offset` as part of the WBWI state which I find it harder to use correctly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13400 Test Plan: updated unit tests to check that update count is tracked correctly and WBWIMemTable is assigning sequence number as expected. Reviewed By: pdillinger Differential Revision: D69666462 Pulled By: cbi42 fbshipit-source-id: 9b18291825017a67c4da3318e8a556aa2971326b	2025-02-19 12:35:56 -08:00
anand76	920d25e34e	Initial version of an external table reader interface (#13401 ) Summary: This PR introduces an interface to plug in an external table file reader into RocksDB. The external table reader may support custom file formats that might work better for a specific use case compared to RocksDB native formats. This initial version allows the external table file to be loaded and queried using an `SstFileReader`. In the near future, we will allow it to be used with a limited RocksDB instance that allows bulkload but not live writes. The model of a DB using an external table reader is a read only database allowing bulkload and atomic replace in the bottommost level only. Live writes, if supported in the future, are expected to use block based table files in higher levels. Tombstones, merge operands, and non-zero sequence numbers are expected to be present only in non-bottommost levels. External table files are assumed to have only Puts, and all keys implicitly have sequence number 0. TODO (in future PRs) - 1. Add support for external file ingestion, with safety mechanisms to prevent accidental writes 2. Add support for atomic column family replace 3. Allow custom table file extensions 4. Add a TableBuilder interface for use with `SstFileWriter` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13401 Reviewed By: pdillinger Differential Revision: D69689351 Pulled By: anand1976 fbshipit-source-id: c5d5b92d56fd4d0fc43a77c4ceb0463d4f479bda	2025-02-19 09:52:08 -08:00
Sarang Masti	a0edca32cf	MultiGet support in SstReader (#13403 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13403 Add MultiGet support in SstReader. Today we only have iteration support and this change also adds MultiGet support to SstFileReader if some application wants to use it. Reviewed By: anand1976 Differential Revision: D69514499 fbshipit-source-id: 20e85a4bd13a3a9f45dacb223c1a4541fb87f561	2025-02-19 08:09:36 -08:00
Hui Xiao	9be5e15a26	Disable auto_refresh_iterator_with_snapshot temporarily in stress test (#13402 ) Summary: Context/Summary: recent crash test failures seem to find issues with recently added auto_refresh_iterator_with_snapshot and prefix/scan, injected read. For now, let's disable the auto_refresh_iterator_with_snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13402 Test Plan: monitor CI Reviewed By: mszeszko-meta Differential Revision: D69677731 Pulled By: hx235 fbshipit-source-id: eea6630e96d53fba8dbf3877a49819690dfab2f6	2025-02-18 11:51:27 -08:00
Changyu Bi	4b5f0a4fcc	Fix GetMergeOperands() in ReadOnly and SecondaryDB (#13396 ) Summary: Noticed that the `do_merge` parameter is not properly set while working on memtable code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13396 Test Plan: updated unit test for the read-only db case. Reviewed By: jaykorean Differential Revision: D69505015 Pulled By: cbi42 fbshipit-source-id: d4c64ca7bba31fe26aa41a29cbc55835d9f1f116	2025-02-18 11:01:19 -08:00
Hui Xiao	7069691f7e	Disable `track_and_verify_wals` completely temporarily (#13405 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/13263 and https://github.com/facebook/rocksdb/pull/13360 disabled `track_and_verify_wals` with some injection under TXN temporarily but recent stress tests has found more issues this feature surfaced even with the previous disabling. Disabling the feature completely now for stabilizing CI while debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13405 Test Plan: Monitor CI Reviewed By: cbi42 Differential Revision: D69759276 Pulled By: hx235 fbshipit-source-id: 501a3561acb9daa834f874095f9a66ae6ae5aa42	2025-02-18 09:39:00 -08:00
Hui Xiao	6aacec07dc	Call Clean() on JobContext before destruction in UT (#13406 ) Summary: Context/Summary: It's [documented (`affcad0cc9/db/job_context.h (L230)`) that `// For non-empty JobContext Clean() has to be called at least once before before destruction`. This is violated in a UT accidentally so causing the assertion failure `assert(logs_to_free.size() == 0);` in` ~JobContext`. This PR is to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13406 Test Plan: Monitor for future UT assertion failure in `TEST_F(DBWALTest, FullPurgePreservesRecycledLog) ` Reviewed By: cbi42 Differential Revision: D69759725 Pulled By: hx235 fbshipit-source-id: dd1617b370a2c69daba657287dcf258542f92ef5	2025-02-18 09:37:03 -08:00
Hui Xiao	affcad0cc9	Fix corrupted wal number when predecessor wal corrupts + minor cleanup (#13359 ) Summary: Context/Summary: `02b4197544` recently added the ability to detect WAL hole presents in the predecessor WAL. It forgot to update the corrupted wal number to point to the predecessor WAL in that corruption case. This PR fixed it. As a bonus, this PR also (1) fixed the `FragmentBufferedReader()` constructor API to expose less parameters as they are never explicitly passed in in the codebase (2) a INFO log wording (3) a parameter naming typo (4) the reporter naming Pull Request resolved: https://github.com/facebook/rocksdb/pull/13359 Test Plan: 1. Manual printing to ensure the corrupted wal number is set to the right number 2. Existing UTs Reviewed By: jowlyzhang Differential Revision: D69068089 Pulled By: hx235 fbshipit-source-id: f7f8a887cded2d3a26cf9982f5d1d1ab6a78e9e1	2025-02-13 21:49:51 -08:00
Hui Xiao	f6b2cdd350	Disable secondary test with sst truncation deletion; API clarification (#13395 ) Summary: Context/Summary: Secondary DB relies on open file descriptor of the shared SST file in primary DB to continue being able to read the file even if that file is deleted in the primary DB. However, this won't work if the file is truncated instead of deleted, which triggers an "truncated block read" corruption in stress test on secondary db reads. Truncation can happen if RocksDB implementation of SSTFileManager and `bytes_max_delete_chunk>0` are used. This PR is to disable such testing combination in stress test and clarify the related API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13395 Test Plan: - Manually repro-ed with below UT. I'm in favor of not including this UT in the codebase as it should be self-evident from the API comment now about the incompatiblity. Secondary DB is in a direction of being replaced by Follower so we should minimize edge-case tests for code with no functional change for a to-be-replaced functionality. ``` TEST_F(DBSecondaryTest, IncompatibleWithPrimarySSTTruncation) { Options options; options.env = env_; options.disable_auto_compactions = true; options.sst_file_manager.reset(NewSstFileManager( env_, nullptr /fs/, "" /trash_dir/, 2024000 /rate_bytes_per_sec/, true /delete_existing_trash/, nullptr /status/, 0.25 /max_trash_db_ratio/, 1129 /bytes_max_delete_chunk/)); Reopen(options); ASSERT_OK(Put("key1", "old_value")); ASSERT_OK(Put("key2", "old_value")); ASSERT_OK(Flush()); ASSERT_OK(Put("key1", "new_value")); ASSERT_OK(Put("key3", "new_value")); ASSERT_OK(Flush()); Options options1; options1.env = env_; options1.max_open_files = -1; Reopen(options); OpenSecondary(options1); ASSERT_OK(db_secondary_->TryCatchUpWithPrimary()); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack( "DeleteScheduler::DeleteTrashFile:Fsync", [&](void*) { std::string value; Status s = db_secondary_->Get(ReadOptions(), "key2", &value); assert(s.IsCorruption()); assert(s.ToString().find("truncated block read") != std::string::npos); }); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing(); ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr)); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing(); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks(); } ``` - Monitor future stress test Reviewed By: jowlyzhang Differential Revision: D69499694 Pulled By: hx235 fbshipit-source-id: 57525b9841897f42aecb758a4d3dd3589367dcd9	2025-02-12 11:00:36 -08:00
Maciej Szeszko	8234d67e5a	Auto refresh iterator with snapshot (#13354 ) Summary: # Problem Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly `Refresh'ed`, the iterator will hold on to the `Init`-time assigned `SuperVersion` throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (_potentially_ referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of _logical_ backups (outside of RocksDB `BackupEngine`). # Solution Building on top of the `Refresh(const Snapshot* snapshot)` API introduced in https://github.com/facebook/rocksdb/pull/10594, we're adding a new `ReadOptions` opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in `read_options.snapshot` at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13354 Test Plan: Correctness: New test to demonstrate the auto refresh behavior in contrast to legacy iterator: `./db_iterator_test --gtest_filter=AutoRefreshIterator`. Stress testing: We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in `db_stress` as we reasonably can with random feature on/off configuration in db_crashtest.py. # Benchmarking The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build, 3-5 times for each respective category or till convergence. In addition, we configured aggressive threshold of 1 second for new `Superversion` creation. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1]. TLDR; Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) in throughput relative to the baseline, if any, is no more than 1-2%. Snapshot initialization approach: This feature is only effective on iterators with well-defined `snapshot` passed via `ReadOptions` config. We modified the existing `db_bench` program to reflect that constraint. However, it quickly turned out that the actual `Snapshot` initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking _potentially_ adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' `Snapshot` for all the future scans _before_ running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be. ## "DB in memory" benchmark DB Setup 1. Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected `000004.log` file is 541MB in size). ``` ./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000 max_write_buffer_number=2 -compression_type=none ``` 2. As a part of recovery in subsequent DB open, WAL will be processed to one or more SST files during the recovery. We're selecting a large block cache (`cache_size` parameter in `db_bench` script) suitable for holding the entire DB to test the “hot path” CPU overhead. ``` ./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=1 -async_io=1 -num=? -seek_nexts=? -cache_size=? -write_buffer_size=? -auto_refresh_iterator_with_snapshot={0\|1} ``` \| seek_nexts=100; num=2,000,000 \| seek_nexts = 20,000; num=50000 \| seek_nexts = 400,000; num=2000 -- \| -- \| -- \| -- baseline \| 36362 (± 300) ops/sec, 928.8 (± 23) MB/s, 99.11% block cache hit \| 52.5 (± 0.5) ops/sec, 1402.05 (± 11.85) MB/s, 99.99% block cache hit \| 156.2 (± 6.3) ms / op, 1330.45 (± 54) MB/s, 99.95% block cache hit auto refresh \| 35775.5 (± 537) ops/sec, 926.65 (± 13.75) MB/s, 99.11% block cache hit \| 53.5 (± 0.5) ops/sec, 1367.9 (± 9.5) MB/s, 99.99% block cache hit \| 162 (± 4.14) ms / op, 1281.35 (± 32.75) MB/s, 99.95% block cache hit _-cache_size=5000000000 -write_buffer_size=3200000000 -max_write_buffer_number=2_ \| seek_nexts=3,500,000; num=100 -- \| -- baseline \| 1447.5 (± 34.5) ms / op, 1255.1 (± 30) MB/s, 98.98% block cache hit auto refresh \| 1473.5 (± 26.5) ms / op, 1232.6 (± 22.2) MB/s, 98.98% block cache hit _-cache_size=17680000000 -write_buffer_size=14500000000 -max_write_buffer_number=2_ \| seek_nexts=17,500,000; num=10 -- \| -- baseline \| 9.11 (± 0.185) s/op, 997 (± 20) MB/s auto refresh \| 9.22 (± 0.1) s/op, 984 (± 11.4) MB/s [1] ### Specs \| Property \| Value -- \| -- RocksDB \| version 10.0.0 Date \| Mon Feb 3 23:21:03 2025 CPU \| 32 * Intel Xeon Processor (Skylake) CPUCache \| 16384 KB Keys \| 16 bytes each (+ 0 bytes user-defined timestamp) Values \| 100 bytes each (50 bytes after compression) Prefix \| 0 bytes RawSize \| 5.5 MB (estimated) FileSize \| 3.1 MB (estimated) Compression \| Snappy Compression sampling rate \| 0 Memtablerep \| SkipListFactory Perf Level \| 1 Reviewed By: pdillinger Differential Revision: D69122091 Pulled By: mszeszko-meta fbshipit-source-id: 147ef7c4fe9507b6fb77f6de03415bf3bec337a8	2025-02-11 19:36:47 -08:00
Jay Huh	a30c0204cc	Set Options File Number for CompactionInput under Mutex Lock (#13394 ) Summary: Options File Number to be read by remote worker is part of the `CompactionServiceInput`. We've been setting this in `ProcessKeyValueCompactionWithCompactionService()` while the db_mutex is not held. This needs to be accessed while the mutex is held. The value can change as part of `SetOptions() -> RenameTempFileToOptionsFile()` as in following. `e6972196bc/db/db_impl/db_impl.cc (L5595-L5596)` Keep this value in memory during `CompactionJob::Prepare()` which is called while the mutex is held, so that we can easily access this later without mutex when building the CompactionInput for the remote compaction. Thanks to the crash test. This was surfaced after https://github.com/facebook/rocksdb/issues/13378 merged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13394 Test Plan: Unit Test ``` ./compaction_service_test ``` Crash Test ``` COERCE_CONTEXT_SWITCH=1 COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j100 dbg ``` ``` python3 -u tools/db_crashtest.py blackbox --enable_remote_compaction=1 ``` Reviewed By: jowlyzhang Differential Revision: D69496313 Pulled By: jaykorean fbshipit-source-id: 7e38e3cb75d5a7708beb4883e1a138e2b09ff837	2025-02-11 19:11:36 -08:00
aletar89	e6972196bc	Add python binding to LANGUAGE-BINDINGS.md (#13391 ) Summary: The only actively maintained python binding is RocksDict Pull Request resolved: https://github.com/facebook/rocksdb/pull/13391 Reviewed By: jowlyzhang Differential Revision: D69431688 Pulled By: cbi42 fbshipit-source-id: 5e564118769a86cb8834a4faa5d852a54bbfea64	2025-02-10 17:38:40 -08:00
Changyu Bi	3f460ad0d2	Reverse the order of updates to the same key in WriteBatchWithIndex (#13387 ) Summary: as a preparation to support merge in [WBWIMemtable](`d48af21386/memtable/wbwi_memtable.h (L31)`), this PR updates how we [order updates to the same key](`d48af21386/utilities/write_batch_with_index/write_batch_with_index_internal.cc (L694-L697)`) in WriteBatchWithIndex. Specifically, the order is now reversed such that more recent update is ordered first. This will make iterating from WriteBatchWithIndex much easier since the key ordering in WBWI now matches internal key order where keys with larger sequence number are ordered first. The ordering is now explicitly documented above the declaration for `WriteBatchWithIndex` class. Places that use `WBWIIteratorImpl` and assume key ordering are updated. The rest is test and comments update. This will affect users who use WBWIIterator directly, the output of GetFromBatch, GetFromBatchAndDB or NewIteratorWithBase are not affected. Users are only affected if they may issue multiple updates to the same key. If WriteBatchWithIndex is created with `overwrite_key=true`, one the the updates needs to be Merge. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13387 Test Plan: we have some good coverage of WBWI, I updated some existing tests and added a test for `WBWIIteratorImpl`. Reviewed By: pdillinger Differential Revision: D69421268 Pulled By: cbi42 fbshipit-source-id: d97eec4ee74aeac3937c9758041c7713f07f9676	2025-02-10 17:15:47 -08:00
Peter Dillinger	c9ce4a3d6b	Improve atomicity of SetOptions, skip manifest write (#13384 ) Summary: Motivated by code review issue in https://github.com/facebook/rocksdb/issues/13316, we don't want to release the DB mutex in SetOptions between updating the cfd latest options and installing the new Version and SuperVersion. SetOptions uses LogAndApply to install a new Version but this currently incurs an unnecessary manifest write. (This is not a big performance concern because SetOptions dumps a new OPTIONS file, which is much larger than the redundant manifest update.) Since we don't want IO while holding the DB mutex, we need to get rid of the manifest write, and that's what this change does. We introduce a kind of dummy VersionEdit that allows the existing code paths of LogAndApply to install a new Version (with the updated mutable options), recompute resulting compaction scores etc., but without the manifest write. Part of the validation for this is new assertions in SetOptions verifying the consistency of the various copies of MutableCFOptions. (I'm not convinced we need it in SuperVersion in addition to Version, but that's not for here and now.) These checks depend on defaulted `operator==` so depend on C++20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13384 Test Plan: New unit test in addition to new assertions. SetOptions already tested heavily in crash test. Used `ROCKSDB_CXX_STANDARD=c++20 make -j100 check` to ensure the new assertions are verified Reviewed By: cbi42 Differential Revision: D69408829 Pulled By: pdillinger fbshipit-source-id: 4cf026010c6bb381e0ea27567cce2708d4678e7d	2025-02-10 16:46:13 -08:00
Hui Xiao	1f36399a77	Blog post about the mitigated misconfig bug (#13386 ) Summary: Context/Summary: as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/13386 Test Plan: Run the webpage locally according to https://github.com/facebook/rocksdb/tree/main/docs and check everything is fine Reviewed By: jaykorean Differential Revision: D69334257 Pulled By: hx235 fbshipit-source-id: 4e9b6dbddd5035b9277044c6cb0ac97b3819ec6c	2025-02-10 13:10:33 -08:00

1 2 3 4 5 ...

13184 Commits