276 Commits

Author SHA1 Message Date
Maciej Szeszko
8e16f8fecf Reduce db stress noise (#13447)
Summary:
[Experiment]

This PR is a followup to https://github.com/facebook/rocksdb/pull/13408. Thick bandaid of ignoring all injected read errors in context of periodic iterator auto refreshes in db stress proved to be effective. We confirmed our theory that errors are not a really a consequence / defect related to this new feature but rather due to subtle ways in which downstream code paths handle their respective IO failures. In this change we're replacing a thick 'ignore all IO read errors' bandaid in `no_batched_ops_stress` with a much smaller, targeted patches in obsolete files purge / delete codepaths, table block cache reader, table cache lookup to make sure we don't miss signal and ensure there's a single mechanism for ignoring error injection in db stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13447

Reviewed By: hx235

Differential Revision: D70794787

Pulled By: mszeszko-meta

fbshipit-source-id: c5fcd4780d82357c407f53bf0bb22fc38f7bd277
2025-03-12 01:13:40 -07:00
Andrew Chang
c1fb33e1d0 Prefetch buffer may not contain all of requested data if EOF is hit (#13376)
Summary:
There was a stress test that failed at the assertion check for `IsDataBlockInBuffer`.

`IsDataBlockInBuffer` is too strict of a condition if we are trying to read past the end of the file.

This seems to be a bug from the original 2019 commit 3737d06adc: 4eb5113091/file/file_prefetch_buffer.cc (L130)

If the caller tries requesting more bytes than are available, then we still return `n` bytes, even if the buffer really only contains `m < n` bytes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13376

Test Plan: I added a unit test which caused the original `IsDataBlockInBuffer ` assertion to fail. I also updated the unit test to check for the result size, which triggered the bug (without this fix) where we return a size of `n` even if less than `n` bytes exist.

Reviewed By: anand1976

Differential Revision: D69269608

Pulled By: archang19

fbshipit-source-id: 1dc0d5930e2b73089850f6e996afbd6192cd5ac8
2025-02-07 13:25:56 -08:00
Peter Dillinger
b3333587eb Clean up some CFOptions code hygiene, fix SetOptions() bug (#13294)
Summary:
To start, I wanted to remove the unnecessary new_options parameter of `InstallSuperVersionAndScheduleWork()`. Passing it something other than the latest mutable options would be inconsistent/outdated. There was even a comment "Use latest MutableCFOptions" on a place that was using the saved options in effect for the compaction.

On investigation, this fixes an undiagnosed but longstanding serious bug in SetOptions() where the new settings can be reverted if a flush or compaction started before the SetOptions() finishes after. Fix confirmed with new unit test in db_test.cc.

I also got tired of seeing the cumbersome usage of pointer rather than const reference for related options accesses, so there's kind of a large (but trivial) refactoring tied in here as well. (Sorry for combining them; wasn't planning a major bug fix)

Intended follow-up: Clarify/simplify the crazy calling conventions of LogAndApply, and remove some unnecessary copying of MutableCFOptions (see new FIXMEs)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13294

Test Plan: test for bug fix, confirmed fails on main and at least as far back as version 8.10. Plus existing tests and CI

Reviewed By: mszeszko-meta

Differential Revision: D68141563

Pulled By: pdillinger

fbshipit-source-id: f6c3290145afa06cc2fe8b485a5de17560a5deea
2025-01-15 13:11:40 -08:00
Andrew Chang
b5e4162582 Unify compaction prefetching logic (#13187)
Summary:
In https://github.com/facebook/rocksdb/pull/13177, I discussed an unsigned integer overflow issue that affects compaction reads inside `FilePrefetchBuffer` when we attempt to enable the file system buffer reuse optimization. In that PR, I disabled the optimization whenever `for_compaction` was `true` to eliminate the source of the bug.

**This PR safely re-enables the optimization when `for_compaction` is `true`.** We need to properly set the overlap buffer through `PrefetchInternal` rather than simply calling `Prefetch`. `Prefetch` assumes `num_buffers_` is 1 (i.e. async IO is disabled), so historically it did not have any overlap buffer logic. What ends up happening (with the old bug) is that, when we try to reuse the file system provided buffer, inside the `Prefetch` method, we read the remaining missing data. However, since we do not do any `RefitTail` method when `use_fs_buffer` is true, normally we would rely on copying the partial relevant data into an overlap buffer. That overlap buffer logic was missing, so the final main buffer ends up storing data from an offset that is greater than the requested offset, and we effectively end up "throwing away" part of the requested data.

**This PR also unifies the prefetching logic for compaction and non-compaction reads:**
- The same readahead size is used. Previously, we read only `std::max(n, readahead_size_)` bytes for compaction reads, rather than `n + readahead_size_` bytes
- The stats for `PREFETCH_HITS` and `PREFETCH_BYTES_USEFUL` are tracked for both. Previously, they were only tracked for non-compaction reads.

These two small changes should help reduce some of the cognitive load required to understand the codebase. The test suite also became easier to maintain. We could not come up with good reasons why the logic for the readahead size and stats should be different for compaction reads.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13187

Test Plan:
I removed the temporary test case from https://github.com/facebook/rocksdb/issues/13200 and incorporated the same test cases into my updated parameterized test case, which tests the valid combinations between `use_async_prefetch` and `for_compaction`.

I went further and added a randomized test case that will simply try to hit `assert`ion failures and catch any missing areas in the logic.

I also added a test case for compaction reads _without_ the file system buffer reuse optimization. I am thinking that it may be valuable to make a future PR that unifies a lot of these prefetch tests and parametrizes as much of them as possible. This way we can avoid writing duplicate tests and just look over different parameters for async IO, direct IO, file system buffer reuse, and `for_compaction`.

Reviewed By: anand1976

Differential Revision: D66903373

Pulled By: archang19

fbshipit-source-id: 351b56abea2f0ec146b83e3d8065ccc69d40405d
2025-01-08 15:22:05 -08:00
Peter Dillinger
c72e79a262 Standardize on clang-format version 18 (#13233)
Summary:
... which is the default for CentOS 9 and Ubuntu 24, the latter of which is now available in GitHub Actions. Relevant CI job updated.

Re-formatted all cc|c|h files except in third-party/, using

```
clang-format -i `git ls-files | grep -E '[.](cc|c|h)$' | grep -v third-party/`
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13233

Test Plan: CI

Reviewed By: jaykorean, archang19

Differential Revision: D67461638

Pulled By: pdillinger

fbshipit-source-id: 0c9ac21a3f5eea6f5ade68bb6af7b6ba16c8b301
2024-12-19 10:58:40 -08:00
Andrew Chang
d957e1a33a Update test implementations for MultiRead with fs_scratch reuse (#13195)
Summary:
This is a follow up to https://github.com/facebook/rocksdb/issues/13189. As mentioned in the description in the previous PR, to guard against similar bugs in the future, we should update our test implementations to reflect the real-world assumptions that we can make about `fs_scratch` when we issue reads with the filesystem buffer reuse optimization. The current test implementations reinforce the misconception that `fs_scratch` points to the same place as `result.data()` (i.e. to the start of the valid data buffer for the read result). `fs_scratch` can point to any arbitrary data structure, but for our purposes, I think we achieve what we want if we just have it point to a `Slice` which wraps the underlying result buffer inside one of its class variables.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13195

Test Plan: Existing unit tests test the same functionality but in an improved way with this change.

Reviewed By: hx235

Differential Revision: D66896380

Pulled By: archang19

fbshipit-source-id: 377e67ec70427716f2b7b7388d99b78003c01eb0
2024-12-17 16:36:04 -08:00
Andrew Chang
6ae3412244 Explain why RandomAccessFileReader* is not passed into FilePrefetchBuffer constructor (#13159)
Summary:
In https://github.com/facebook/rocksdb/pull/13118#discussion_r1842848359, we decided to make a separate follow-up PR that refactors `FilePrefetchBuffer` to determine `use_fs_buffer` once at construction time.

The change would have involved passing in the `RandomAccessFileReader*` directly to the constructor, and using that to determine `use_fs_buffer`. This would avoid repeatedly calling `UseFSBuffer(RandomAccessFileReader* reader)` during the actual prefetch requests.

I started working on this refactoring change but ran into issues with these 2 files, which used `GetOrCreatePrefetchBuffer`
- https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_iterator.cc
- https://github.com/facebook/rocksdb/blob/main/db/merge_helper.cc

As I explained in the added code comments, sometimes the `RandomAccessFileReader*` is not available when we construct the `FilePrefetchBuffer`, so although it is not the most elegant, I think right now it makes sense to pass in the `reader` into the `Prefetch` / `PrefetchAsync` / `TryReadFromCache` calls. Maybe there is a workaround but I don't think the refactor would be worth it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13159

Test Plan: N/A (comments)

Reviewed By: anand1976

Differential Revision: D66473731

Pulled By: archang19

fbshipit-source-id: ce3473694c2cd82513da1a76ad5995afa5bc9cfa
2024-12-17 15:25:04 -08:00
Andrew Chang
5aead7af3d Pass through use_fs_buffer to Read method (#13200)
Summary:
This is a follow up to https://github.com/facebook/rocksdb/issues/13177, which was supposed to disable the file system buffer optimization for compaction reads. However, it did not work as expected because I did not pass through `use_fs_buffer` to the `Read` method, which also calls `UseFSBuffer`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13200

Test Plan:
I added simple tests to verify we do not hit the overflow issue when we are doing compaction prefetches.

```
 ./prefetch_test --gtest_filter="*FSBufferPrefetchForCompaction*"
```

Of course I will be looking through the warm storage crash test logs as well once the change is merged.

Reviewed By: anand1976

Differential Revision: D66996079

Pulled By: archang19

fbshipit-source-id: b4d9254f1354ccfc53a307174de5f2388b7e5474
2024-12-09 17:58:55 -08:00
Andrew Chang
d386385e0b Temporarily disable file system buffer reuse optimization for compaction prefetches (#13177)
Summary:
https://github.com/facebook/rocksdb/issues/13182 successfully fixed the heap `use-after-free` issue.

However, there was one additional error I found while looking through the warm storage crash test logs. There are repeated (though infrequent) unsigned pointer arithmetic overflow errors that look like this:
```cpp
file_prefetch_buffer.cc:860:46: runtime error: addition of unsigned offset to 0x7f282001880f overflowed to 0x7f2820017667
```

It took me a while to figure it out, but I was finally able to reproduce the issue locally. It turns out the issue is when we call `TryReadFromCache` with `for_compaction` set to `true`. The default value for `for_compaction` is `false`, and this was not covered in the unit tests written for https://github.com/facebook/rocksdb/issues/13118.

When I run the same unit tests with `for_compaction` set to `true`, I am able to break this assertion that I added at the end of `TryReadFromCacheUntracked`:
```cpp
assert(buf->offset_ <= offset);
```

If `buf->offset_` is greater than `offset`, then that explains the overflow we get in the following lines:
```cpp
uint64_t offset_in_buffer = offset - buf->offset_;
*result = Slice(buf->buffer_.BufferStart() + offset_in_buffer, n);
```

I will have another PR out that fixes the issue and enables the optimization when `for_compaction` is set to `true`. I will need to add some overlap buffer logic, similar to what I have inside `PrefetchInternal`. For now, since I have confirmed that there is indeed a bug, we should disable the optimization where needed. It will take me some time to implement the fix and write new test cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13177

Test Plan: I kept the existing unit tests which test the file system buffer reuse code when `for_compaction` is `false`. I expect that the warm storage crash test logs will no longer show the integer overflow issue once we merge this PR.

Reviewed By: anand1976

Differential Revision: D66721857

Pulled By: archang19

fbshipit-source-id: 22d523646f969a7a0ccbbea73f63c32601f1179a
2024-12-09 13:24:16 -08:00
Andrew Chang
debd87596a Potential fix for heap use-after-free issue (#13182)
Summary:
This PR is an attempt to address https://github.com/facebook/rocksdb/issues/13118. The warm storage crash tests show use-after-free errors. They do not occur in every single crash test run, but with enough attempts, they are repeatable.

Theory 1:
I am wondering if the `fs_buffer` is being prematurely freed before we take ownership of it. In `SetBuffer`, I was passing in `FSAllocationPtr&& new_buf` rather than `FSAllocationPtr new_buf`. When I pass the parameter as `FSAllocationPtr&& new_buf`, only after the `buf_ = std::move(new_buf);` line is run is ownership transferred from the original `FSAllocationPtr`. But before that I had a line `bufstart_ = reinterpret_cast<char*>(buf_.get());`. So I am hypothesizing that it is possible, under certain race conditions, that between the first `buf_.get()` and the `buf_ = std::move(new_buf);`, the `fs_buffer` was altered, leaving `bufstart_` pointing to some freed memory area.

Theory 2 (from anand1976):
Perhaps we need to set the `bufstart_` based on the `Slice` rather than the `FSAllocationPtr`. This would be more consistent with what we do here https://github.com/facebook/rocksdb/blob/main/table/block_fetcher.cc#L275.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13182

Test Plan: The existing unit tests and CI ensures I am not making anything worse, but I will want to wait and see if the daily crash tests runs still have the same `heap-use-after-free` errors with this change. Alternatively, if we fail the `assert` I just added, then I can make a follow-up PR to return `false` from `TryReadFromCache` whenever we get handed back a `nullptr`.

Reviewed By: anand1976

Differential Revision: D66771852

Pulled By: archang19

fbshipit-source-id: 5b585d86d657ec050a04e892d3b1cf4383f377f9
2024-12-05 08:36:41 -08:00
anand76
a1be80c5c2 Try to align WritableFileWriter buffered writes (#13158)
Summary:
In buffered IO mode, without checksum calculation for buffered data enabled, try to align writes to the file system on a power of two. This can improve performance, especially on a distributed file system like Warm Storage that does erasure coding and benefits from full stripe writes. We do this by filling up the writable buffer, with a partial append if necessary, before flushing. When checksum calculation for buffered data is enabled, we don't do this since its preferable to not split the data, especially if the caller provides the checksum. We don't guarantee alignment if the caller manually flushes before finishing the file.

Tests:
Add unit tests in file_reader_writer_test and external_sst_file_basic_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13158

Reviewed By: pdillinger

Differential Revision: D66669367

Pulled By: anand1976

fbshipit-source-id: 6df1b4538bda696e2170515420ee4c3766c83bb8
2024-12-03 10:41:24 -08:00
Andrew Chang
26b480609c Update FilePrefetchBuffer::Read to reuse file system buffer when possible (#13118)
Summary:
This PR adds support for reusing the file system provided buffer to avoid an extra `memcpy` into RockDB's buffer. This optimization has already been implemented for point lookups, as well as compaction and scan reads _when prefetching is disabled_.

This PR extends this optimization to work with synchronous prefetching (`num_buffers == 1`). Asynchronous prefetching can be addressed in a future PR (and probably should be to keep this PR from growing too large).

Remarks
- To handle the case where the main buffer only has part of the requested data, I used the existing `overlap_buf_` (currently used in the async prefetching case) instead of defining a separate buffer. This was discussed in https://github.com/facebook/rocksdb/pull/13118#discussion_r1842839360.
- We use `MultiRead` with a single request to take advantage of the file system buffer. This is consistent with previous work (e.g. https://github.com/facebook/rocksdb/pull/12266).
- Even without the tests I added, there was some code coverage inside in at least `DBIOCorruptionTest.IterReadCorruptionRetry`, since those tests were failing before I addressed a bug in my code for this PR. [Run with failed test](https://github.com/facebook/rocksdb/actions/runs/11708830448/job/32611508818?pr=13118).
- This prefetching code is not too easy to follow, so I added quite a bit of comments to both the code and test case to try to make it easier to understand the exact internal state of the prefetch buffer at every point in time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13118

Test Plan:
I wrote pretty thorough unit tests that cover synchronous prefetching with file system buffer reuse.  The flows for partial hits, complete hits, and complete misses are tested. I also parametrized the test to make sure the async prefetching (without file system buffer reuse) still work as expected.

Once we agree on the changes, I will run a long stress test before merging.

Reviewed By: anand1976

Differential Revision: D65559101

Pulled By: archang19

fbshipit-source-id: 1a56d846e918c20a009b83f1371c1791f69849ae
2024-11-21 12:32:13 -08:00
Peter Dillinger
3fd1f11d35 Fix race to make BlockBasedTableOptions effectively mutable (#13082)
Summary:
Fix a longstanding race condition in SetOptions for `block_based_table_factory` options. The fix is mostly described in new, unified `TableFactoryParseFn()` in `cf_options.cc`. Also in this PR:
* Adds a virtual `Clone()` function to TableFactory
* To avoid behavioral hiccups with `SetOptions`, make the "hidden state" of `BlockBasedTableFactory` shared between an original and a clone. For example, `TailPrefetchStats`
* `Configurable` was allowed to be copied but was not safe to do so, because the copy would have and use pointers into object it was copied from (!!!). This has been fixed using relative instead of absolute pointers, though it's still technically relying on undefined behavior (consistent object layout for non-standard-layout types).

For future follow-up:
* Deny SetOptions on block cache options (dubious and not yet made safe with proper shared_ptr handling)

Fixes https://github.com/facebook/rocksdb/issues/10079

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13082

Test Plan:
added to unit tests and crash test

Ran TSAN blackbox crashtest for hours with options to amplify potential race (see https://github.com/facebook/rocksdb/issues/10079)

Reviewed By: cbi42

Differential Revision: D64947243

Pulled By: pdillinger

fbshipit-source-id: 8390299149f50e2a2b39a5247680f2637edb23c8
2024-10-25 10:24:54 -07:00
Hui Xiao
58fc9d61b7 Trim readahead size based on prefix during prefix scan (#13040)
Summary:
**Context/Summary:**
During prefix scan, prefetched data blocks containing keys not in the same prefix as the `Seek()`'s key will be wasted when `ReadOptions::prefix_same_as_start = true` since they won't be returned to the user. This PR is to exclude those data blocks from being prefetched in a similar manner like trimming according to `ReadOptions::iterate_upper_bound`.

Bonus: refactoring to some existing prefetch test so they are easier to extend and read

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13040

Test Plan:
- New UT, integration to existing UTs
- Benchmark to ensure no regression from CPU due to more trimming logic
```
// Build DB with one sorted run under the same prefix
./db_bench --benchmarks=fillrandom --prefix_size=3 --keys_per_prefix=5000000 --num=5000000 --db=/dev/shm/db_bench --disable_auto_compactions=1
```
```
// Augment the existing db bench to call `Seek()` instead of `SeekToFirst()` in `void ReadSequential(){..}` to trigger the logic in this PR

+++ b/tools/db_bench_tool.cc
@@ -5900,7 +5900,12 @@ class Benchmark {
     Iterator* iter = db->NewIterator(options);
     int64_t i = 0;
     int64_t bytes = 0;
-    for (iter->SeekToFirst(); i < reads_ && iter->Valid(); iter->Next()) {
+
+    iter->SeekToFirst();
+    assert(iter->status().ok() && iter->Valid());
+    auto prefix = prefix_extractor_->Transform(iter->key());
+
+    for (iter->Seek(prefix); i < reads_ && iter->Valid(); iter->Next()) {
       bytes += iter->key().size() + iter->value().size();
       thread->stats.FinishedOps(nullptr, db, 1, kRead);
       ++i;
:
```
```
// Compare prefix scan performance
./db_bench --benchmarks=readseq[-X20] --prefix_size=3  --prefix_same_as_start=1 --auto_readahead_size=1 --cache_size=1 --use_existing_db=1 --db=/dev/shm/db_bench --disable_auto_compactions=1

// Before PR
readseq [AVG    20 runs] : 2449011 (± 50238) ops/sec;  270.9 (± 5.6) MB/sec
readseq [MEDIAN 20 runs] : 2499167 ops/sec;  276.5 MB/sec

// After PR  (regress 0.4 %)
readseq [AVG    20 runs] : 2439098 (± 42931) ops/sec;  269.8 (± 4.7) MB/sec
readseq [MEDIAN 20 runs] : 2460859 ops/sec;  272.2 MB/sec

```

- Stress test: randomly set `prefix_same_as_start` in `TestPrefixScan()`. Run below for a while
```
python3 tools/db_crashtest.py --simple blackbox --prefix_size=5 --prefixpercent=65 --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=-1 --bloom_bits=3 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_style=2 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=8388608 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=0 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32  --readpercent=10 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=10 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=10
```

Reviewed By: anand1976

Differential Revision: D64367065

Pulled By: hx235

fbshipit-source-id: 5750c05ccc835c3e9dc81c961b76deaf30bd23c2
2024-10-17 15:52:55 -07:00
Nick Brekhus
0611eb5b9d Fix orphaned files in SstFileManager (#13015)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015

`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.

Reviewed By: jowlyzhang

Differential Revision: D62590773

fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
2024-09-18 13:27:44 -07:00
Nick Brekhus
40adb2bab7 Fix wraparound in SstFileManager (#13010)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13010

The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in #4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name.

Reviewed By: anand1976

Differential Revision: D62535775

fbshipit-source-id: a7c44d6e0a4b5ff74bc47abfe57c32ca6770243d
2024-09-13 14:39:48 -07:00
Peter Dillinger
96340dbce2 Options for file temperature for more files (#12957)
Summary:
We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options:
* `metadata_write_temperature` for DB "small" files that don't contain much user data
* `wal_write_temperature` for WALs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957

Test Plan: Unit test included, though it's hard to be sure we've covered all the places

Reviewed By: jowlyzhang

Differential Revision: D61664815

Pulled By: pdillinger

fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b
2024-08-23 19:49:25 -07:00
Yu Zhang
d12aaf23ca Fix file deletions in DestroyDB not rate limited (#12891)
Summary:
Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`.

It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it.  This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891

Test Plan: Added unit tests, existing tests.

Reviewed By: anand1976

Differential Revision: D60300220

Pulled By: jowlyzhang

fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7
2024-08-02 19:31:55 -07:00
Hui Xiao
408e8d4c85 Handle injected write error after successful WAL write in crash test + misc (#12838)
Summary:
**Context/Summary:**
We discovered the following false positive in our crash test lately:
(1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL
(2) Current stress test logic will rollback the expected state of such k/v since PUT() fails
(3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed.

We decided to leave those expected state to be pending until the loop-write of the same key succeeds.

Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838

Test Plan:
Rehearsal CI
Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine

```
python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40
```

Reviewed By: cbi42

Differential Revision: D59377075

Pulled By: hx235

fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6
2024-07-29 13:51:49 -07:00
Changyu Bi
b800b5eb6a Deflake ThreadStatus related unit tests (#12858)
Summary:
Unit tests `DBTest.ThreadStatusFlush` and `DBTestWithParam.ThreadStatusSingleCompaction` have been flaky and fail with error message
```
[ RUN      ] DBTest.ThreadStatusFlush
op_count: 0, expected_count 1
thread id: 718113, thread status: , cf_name
thread id: 718114, thread status: , cf_name pikachu
/__w/rocksdb/rocksdb/db/db_test.cc:4817: Failure
Value of: VerifyOperationCount(env_, ThreadStatus::OP_FLUSH, 1)
  Actual: false
Expected: true
[  FAILED  ] DBTest.ThreadStatusFlush (106 ms)

[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
db/db_test.cc:4673: Failure
Expected equality of these values:
  op_count
    Which is: 0
  expected_count
    Which is: 1
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false)
```

One cause for this is that before flush/compaction finishes, we will go through `~WritableFileWriter()`, either for WAL or SST file, and temporarily set thread_operation to UNKNOWN. This UNKNOWN thread operation seem to be there for some stress test verification. This PR fixes these tests by setting the IOActivity in ~WritableFileWriter() for debug build.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12858

Test Plan: monitor future test failure.

Reviewed By: hx235

Differential Revision: D59691564

Pulled By: cbi42

fbshipit-source-id: 3f96998bba9d42aba50d1830c2b51bef2dd6705f
2024-07-15 09:56:09 -07:00
Hui Xiao
84296bc248 Reset seen_injected_error_ with seen_error_ (#12830)
Summary:
**Context/Summary** : as titled as seen_injected_error_ is a subcategory of seen_error_

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12830

Test Plan: existing CI as it only affects crash test code

Reviewed By: jaykorean

Differential Revision: D59249018

Pulled By: hx235

fbshipit-source-id: 20e4c22cade57e12a104a03999e4c841a3648b11
2024-07-01 16:40:57 -07:00
Peter Dillinger
efba8f5b27 Respect ReadOptions::read_tier in prefetching (#12782)
Summary:
a pre-existing flaw revealed by crash test with uncache behavior. Easy fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12782

Test Plan: Modified unit test PrefetchTest.Basic (fails without fix)

Reviewed By: hx235

Differential Revision: D58757916

Pulled By: pdillinger

fbshipit-source-id: 23c0240c7cf0cb0b69a372f9531c07af920e09da
2024-06-19 09:53:59 -07:00
Hui Xiao
1adb935720 Inject more errors to more files in stress test (#12713)
Summary:
**Context:**
We currently have partial error injection:
- DB operation: all read, SST write
- DB open: all read, SST write, all metadata write.

This PR completes the error injection (with some limitations below):
- DB operation & open: all read, all write, all metadata write, all metadata read

**Summary:**
- Inject retryable metadata read, metadata write error concerning directory (e.g, dir sync, ) or file metadata (e.g, name, size, file creation/deletion...)
- Inject retryable errors to all major file types: random access file, sequential file, writable file
- Allow db stress test operations to handle above injected errors gracefully without crashing
- Change all error injection to thread-local implementation for easier disabling and enabling in the same thread. For example, we can control error handling thread to have no error injection. It's also cleaner in code.
   - Limitation: compared to before, we now don't have write fault injection for backup/restore CopyOrCreateFiles work threads since they use anonymous background threads as well as read injection for db open bg thread
- Add a new flag to test error recovery without error injection so we can test the path where error recovery actually succeeds
- Some Refactory & fix to db stress test framework (see PR review comments)
- Fix some minor bugs surfaced (see PR review comments)
- Limitation: had to disable backup restore with metadata read/write injection since it surfaces too many testing issues. Will add it back later to focus on surfacing actual code/internal bugs first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12713

Test Plan:
- Existing UT
- CI with no trivial error failure

Reviewed By: pdillinger

Differential Revision: D58326608

Pulled By: hx235

fbshipit-source-id: 011b5195aaeb6011641ae0a9194f7f2a0e325ad7
2024-06-19 08:42:00 -07:00
Peter Dillinger
0646ec6e2d Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
  `background_close_inactive_wals`.

Stacked on https://github.com/facebook/rocksdb/issues/12731

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734

Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)

Reviewed By: cbi42

Differential Revision: D58295284

Pulled By: pdillinger

fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
2024-06-12 11:48:45 -07:00
Peter Dillinger
98393f0139 Fix Checkpoint hard link of inactive but unsynced WAL (#12731)
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See https://github.com/facebook/rocksdb/issues/10185

Stacked on https://github.com/facebook/rocksdb/issues/12729

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731

Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Reviewed By: cbi42

Differential Revision: D58199629

Pulled By: pdillinger

fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
2024-06-05 17:40:09 -07:00
Peter Dillinger
9f4c597d83 FaultInjectionTestFS read unsynced data by default (#12729)
Summary:
In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.)

This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues.

Also removes unused and confusing field `pos_at_last_flush_`

This change is needed to support testing a relevant bug fix (in a follow-up diff).  Other suggested follow-up:
* Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing.
* Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items).
* Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729

Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test

Reviewed By: hx235

Differential Revision: D58091835

Pulled By: pdillinger

fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33
2024-06-04 15:25:23 -07:00
Hui Xiao
d7b938882e Sync WAL during db Close() (#12556)
Summary:
**Context/Summary:**
Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.
```
./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100

 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound:
Verification failed :(
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12556

Test Plan:
- New UT
- Same stress test command failed before this fix but pass after
- CI

Reviewed By: ajkr

Differential Revision: D56267964

Pulled By: hx235

fbshipit-source-id: af1b7e8769c129f64ba1c7f1ff17102f1239b929
2024-05-20 17:33:43 -07:00
Yu Zhang
241253053a Fix delete obsolete files on recovery not rate limited (#12590)
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.

The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.

In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.

There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590

Test Plan: Added unit test and verified with manual testing

Reviewed By: anand1976

Differential Revision: D56830519

Pulled By: jowlyzhang

fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
2024-05-01 12:26:54 -07:00
Hui Xiao
abd6751aba Fix wrong padded bytes being used to generate file checksum (#12598)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598

Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption:  ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 421334 (± 4126) ops/sec;   46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG    300 runs] : 425768 (± 4309) ops/sec;   47.1 (± 0.5) MB/sec

Reviewed By: ajkr, anand1976

Differential Revision: D56725688

Pulled By: hx235

fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
2024-04-30 15:38:53 -07:00
Hui Xiao
7d83b4e3e5 Fix file checksum mismatch due to padded bytes when block_align=true (#12542)
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.

We decided to include those padded bytes in generating the checksum upon file creation.

Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542

Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 422857 (± 3942) ops/sec;   46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG    300 runs] : 424707 (± 3799) ops/sec;   47.0 (± 0.4) MB/sec

Reviewed By: ajkr

Differential Revision: D56168447

Pulled By: hx235

fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
2024-04-22 14:07:34 -07:00
anand76
97991960e9 Retry DB::Open upon a corruption detected while reading the MANIFEST (#12518)
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518

Reviewed By: ajkr

Differential Revision: D55932155

Pulled By: anand1976

fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
2024-04-18 17:36:33 -07:00
Peter Dillinger
a53ed91691 Fix/improve temperature handling for file ingestion (#12402)
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388

In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"

In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.

Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402

Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.

Used macros instead of functions for better tracing to critical source location on test failures.

Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.

Reviewed By: jowlyzhang

Differential Revision: D54442794

Pulled By: pdillinger

fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
2024-03-05 16:56:08 -08:00
Peter Dillinger
13ef21c22e default_write_temperature option (#12388)
Summary:
Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats.

This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply.

Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction.

Needed follow-up: Fix temperature handling with external file ingestion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388

Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.)

Reviewed By: jowlyzhang

Differential Revision: D54266574

Pulled By: pdillinger

fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb
2024-02-28 14:36:13 -08:00
Akanksha Mahajan
956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
Peter Dillinger
54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Peter Dillinger
76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
akankshamahajan
b9cb7b9644 Provide support for FSBuffer for point lookups (#12266)
Summary:
Provide support for FSBuffer for point lookups

It also add support for compaction and scan reads that goes through BlockFetcher when readahead/prefetching is not enabled.

Some of the compaction/Scan reads goes through FilePrefetchBuffer and some through BlockFetcher. This PR add support to use underlying file system scratch buffer for reads that go through BlockFetcher as for FilePrefetch reads, design is complicated to support this feature.

Design - In order to use underlying FileSystem provided scratch for Reads, it uses MultiRead with 1 request instead of Read API which required API change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12266

Test Plan: Stress test using underlying file system  scratch buffer internally.

Reviewed By: anand1976

Differential Revision: D53019089

Pulled By: akankshamahajan15

fbshipit-source-id: 4fe3d090d77363320e4b67186fd4d51c005c0961
2024-01-29 15:08:20 -08:00
Peter Dillinger
4e60663b31 Remove unnecessary, confusing 'extern' (#12300)
Summary:
In C++, `extern` is redundant in a number of cases:
* "Global" function declarations and definitions
* "Global" variable definitions when already declared `extern`

For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h)
as standard best practice would prescribe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300

Test Plan: no functional changes, CI

Reviewed By: ajkr

Differential Revision: D53148629

Pulled By: pdillinger

fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886
2024-01-29 10:38:08 -08:00
Hui Xiao
a31fded253 Pass rate_limiter_priority from SequentialFileReader to FS (#12296)
Summary:
**Context/Summary:**
The rate_limiter_priority passed to SequentialFileReader is now passed down to underlying file system. This allows the priority associated with backup/restore SST reads to be exposed to FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12296

Test Plan: - Modified existing UT

Reviewed By: pdillinger

Differential Revision: D53100368

Pulled By: hx235

fbshipit-source-id: b4a28917efbb1b0d16f9d1c2b38769bffcff0f34
2024-01-25 18:20:31 -08:00
akankshamahajan
b5bb553d5e Fix PREFETCH_BYTES_USEFUL stat calculation (#12251)
Summary:
After refactoring of FilePrefetchBuffer, PREFETCH_BYTES_USEFUL was miscalculated. Instead of calculating how many requested bytes are already in the buffer, it took into account alignment as well because aligned_useful_len takes into consideration alignment too.

Also refactored the naming of chunk_offset_in_buffer to make it similar to aligned_useful_len

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12251

Test Plan:
1. Validated internally through release validation benchmarks.
2. Updated unit test that fails without the fix.

Reviewed By: ajkr

Differential Revision: D52891112

Pulled By: akankshamahajan15

fbshipit-source-id: 2526a0b0572d473beaf8b841f2f9c2f6275d9779
2024-01-18 19:09:49 -08:00
akankshamahajan
1de6940980 Fix heap use after free error in FilePrefetchBuffer (#12211)
Summary:
Fix heap use after free error in FilePrefetchBuffer
Fix heap use after free error in FilePrefetchBuffer

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12211

Test Plan:
Ran db_stress in ASAN mode
```
==652957==ERROR: AddressSanitizer: heap-use-after-free on address 0x6150006d8578 at pc 0x7f91f74ae85b bp 0x7f91c25f90c0 sp 0x7f91c25f90b8
READ of size 8 at 0x6150006d8578 thread T48
    #0 0x7f91f74ae85a in void __gnu_cxx::new_allocator<rocksdb::BufferInfo*>::construct<rocksdb::BufferInfo*, rocksdb::BufferInfo*&>(rocksdb::BufferInfo**, rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/ext/new_allocator.h:163
    https://github.com/facebook/rocksdb/issues/1 0x7f91f74ae85a in void std::allocator_traits<std::allocator<rocksdb::BufferInfo*> >::construct<rocksdb::BufferInfo*, rocksdb::BufferInfo*&>(std::allocator<rocksdb::BufferInfo*>&, rocksdb::BufferInfo**, rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/bits/alloc_traits.h:512
    https://github.com/facebook/rocksdb/issues/2 0x7f91f74ae85a in rocksdb::BufferInfo*& std::deque<rocksdb::BufferInfo*, std::allocator<rocksdb::BufferInfo*> >::emplace_back<rocksdb::BufferInfo*&>(rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/bits/deque.tcc:170
    https://github.com/facebook/rocksdb/issues/3 0x7f91f74b93d8 in rocksdb::FilePrefetchBuffer::FreeAllBuffers() file/file_prefetch_buffer.h:557
```

Reviewed By: ajkr

Differential Revision: D52575217

Pulled By: akankshamahajan15

fbshipit-source-id: 6811ec10a393f5a62fedaff0fab5fd6e823c2687
2024-01-05 18:10:58 -08:00
akankshamahajan
5cb2d09d47 Refactor FilePrefetchBuffer code (#12097)
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.

 If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.

- Merged Sync and Async code flow into one in FilePrefetchBuffer.

Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097

Reviewed By: ajkr

Differential Revision: D51759552

Pulled By: akankshamahajan15

fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
2024-01-05 09:29:01 -08:00
Peter Dillinger
ed46981bea Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206)
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.

In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
  * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.

Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206

Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200

Reviewed By: akankshamahajan15

Differential Revision: D52551701

Pulled By: pdillinger

fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
2024-01-04 18:39:05 -08:00
Hui Xiao
06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
akankshamahajan
e7c6259447 Make auto_readahead_size default true (#12080)
Summary:
Make auto_readahead_size option default true

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12080

Test Plan: benchmarks and exisiting tests

Reviewed By: anand1976

Differential Revision: D52152132

Pulled By: akankshamahajan15

fbshipit-source-id: f1515563564e77df457dff2e865e4ede8c3ddf44
2023-12-14 11:25:51 -08:00
akankshamahajan
c77b50a4fd Add AsyncIO support for tuning readahead_size by block cache lookup (#11936)
Summary:
Add support for tuning of readahead_size by block cache lookup for async_io.

**Design/ Implementation** -

**BlockBasedTableIterator.cc** -

`BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start
and end offset passed. This function looks into the block cache for the blocks between `start_offset`
and `end_offset` and add all the handles in the queue.

It then iterates from the end in the handles to find first miss block and update the end offset to that block.
It also iterates from the start and find first miss block and update the start offset to that block.

```
_read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block
                             synchronously.
                             False if current call is to prefetch additional data in extra buffers
                            (due to ReadAsync call in FilePrefetchBuffer)
```
In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache),
it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read.

**FilePrefetchBuffer.cc** -

FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that
callback to get updated start and end offset to read based on cache hits/misses.

1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true.
2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching,  the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss).

Foreg. if following are the data blocks with cache hit/miss and start offset
and Read API found miss on DB1 and based on readahead_size (50)  it passes end offset to be 50.
 [DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40]
 [DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90]

- For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache.
- Read calls saves initial end offset 50 as that was meant to be prefetched.
- Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100
- On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8).
- And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset.
-  `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer.

- If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936

Test Plan:
- Ran crash_test several times.
-  New unit tests added.

Reviewed By: anand1976

Differential Revision: D50906217

Pulled By: akankshamahajan15

fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf
2023-12-06 13:48:15 -08:00
Yu Zhang
ba8fa0f546 internal_repo_rocksdb (4372117296613874540) (#12117)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117

Reviewed By: ajkr

Differential Revision: D51745846

Pulled By: jowlyzhang

fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914
2023-12-04 11:17:32 -08:00
anand76
e81393e81e Add some stats to observe the usefulness of scan prefetching (#11981)
Summary:
Add stats for better observability of scan prefetching. Its only implemented for sync scan right now. These stats can help inform future improvements in scan prefetching.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11981

Test Plan: Add a new unit test

Reviewed By: akankshamahajan15

Differential Revision: D50516505

Pulled By: anand1976

fbshipit-source-id: cb1cc6cf02df8295930a49c62b11870020df3f97
2023-10-23 14:42:44 -07:00
Changyu Bi
d5bc30befa Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.

The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.

Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975

Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.

Reviewed By: ajkr

Differential Revision: D50383790

Pulled By: cbi42

fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
2023-10-18 09:38:38 -07:00
Akanksha Mahajan
018eede679 Remove assertion from PrefetchAsync (#11965)
Summary:
Remove assertion from PrefetchAsync (roundup_len2 >= alignment) as for non direct_io, buffer size can be less than alignment resulting in assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11965

Test Plan: Ran the issue causing db_stress without this assertion and the verification completes successfully.

Reviewed By: anand1976

Differential Revision: D50328955

Pulled By: akankshamahajan15

fbshipit-source-id: 65f55ca230d2bbc63f4e2cc34c7273b22b515879
2023-10-16 15:14:58 -07:00