5064 Commits

Author SHA1 Message Date
Hui Xiao
f79b3d19a7 Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291)
Summary:
**Context/Summary:**
Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition

**Testing:**
- `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
    - flaky tests caused by db mutex related race condition
       - Reverted https://github.com/facebook/rocksdb/pull/9528
       - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
       - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
       - This proves our injection can expose flaky tests caused by db mutex related race condition faster.
    -  known or new race-condition-type of internal bug by continuously running this PR
- Performance
   - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
```
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
--write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
```
  - Start-up time:  COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
```
TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress

TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291

Reviewed By: ajkr

Differential Revision: D39231182

Pulled By: hx235

fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
2022-09-12 13:55:23 -07:00
Yanqin Jin
3d67d79154 Fix overlapping check by excluding timestamp (#10615)
Summary:
With user-defined timestamp, checking overlapping should exclude
timestamp part from key. This has already been done for range checking
for files in sstableKeyCompare(), but not yet done when checking with
concurrent compactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615

Test Plan:
(Will add more tests)

make check
(Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a)
rm -rf /dev/shm/rocksdb/* &&
mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected &&
./db_stress
--allow_data_in_errors=True --clear_column_family_one_in=0
--continuous_verification_interval=0 --data_block_index_type=1
--db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
--delrangepercent=0
--expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
--iterpercent=0 --max_background_compactions=20
--max_bytes_for_level_base=10485760 --max_key=25000000
--max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
--ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0
--prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
--snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3
--target_file_size_base=65536 --target_file_size_multiplier=2
--test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
--user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
--write_buffer_size=65536 --writepercent=60 -disable_wal=1

Reviewed By: akankshamahajan15

Differential Revision: D39146797

Pulled By: riversand963

fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 13:03:07 -07:00
Levi Tamasi
fe56cb9aa0 Eliminate some allocations/copies around the blob cache (#10647)
Summary:
Historically, `BlobFileReader` has returned the blob(s) read from the file
in the `PinnableSlice` provided by the client. This interface was
preserved when caching was implemented for blobs, which meant that
the blob data was copied multiple times when caching was in use: first,
into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`),
and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`).
The patch eliminates these copies and the related allocations by changing
`BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents`
objects that can be directly inserted into the cache. The allocations backing
these `BlobContents` objects are made using the blob cache's allocator if the
blobs are to be inserted into the cache (i.e. if a cache is configured and the
`fill_cache` read option is set). Note: this PR focuses on the common case when
blobs are compressed; some further small optimizations are possible for uncompressed
blobs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39335185

Pulled By: ltamasi

fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36
2022-09-08 12:40:18 -07:00
Peter Dillinger
6de7081cf3 Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.

One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.

Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`

Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532

Test Plan:
updated unit tests

Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec

Reviewed By: jay-zhuang

Differential Revision: D38765551

Pulled By: pdillinger

fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-07 22:52:42 -07:00
Bo Wang
d490bfcdb6 Avoid recompressing cold block in CompressedSecondaryCache (#10527)
Summary:
**Summary:**
When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.

When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.

**Implementation Details**
Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)

The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
1. If a handle is found in primary cache:
  1.1. If the handle's value is not nullptr, it is returned immediately.
  1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
    - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
    - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
2. If a handle is not found in primary cache:
  2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
  2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.

The behaviors of `LRUCacheShard::Promote()` are updated as follows:
1. If `e->sec_handle` has value, one of the following steps can happen:
  1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
  1.2. Insert the item into the primary cache and return the handle to caller.
  1.3. Exception handling.
3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.

The behavior of  `CompressedSecondaryCache::Insert()` is updated:
1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
4. If a dummy item is found for a block, the block is inserted into the secondary cache.

The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
1. If a handle is not found or it is a dummy item, a nullptr is returned.
2. If `erase_handle` is true, the handle is erased.

The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527

Test Plan:
1. stress tests.
5. unit tests.
6. CPU profiling for db_bench.

Reviewed By: siying

Differential Revision: D38747613

Pulled By: gitbw95

fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
2022-09-07 19:00:27 -07:00
Levi Tamasi
c8543296ca Support custom allocators for the blob cache (#10628)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10628

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D39228165

Pulled By: ltamasi

fbshipit-source-id: 591fdff08db400b170b26f0165551f86d33c1dbf
2022-09-06 13:31:48 -07:00
Andrew Kryczka
5a97e6b1d2 Deflake blob caching tests (#10636)
Summary:
Example failure:

```
db/blob/db_blob_basic_test.cc:226: Failure
Expected equality of these values:
  i
    Which is: 1
  num_blobs
    Which is: 5
```

I can't repro locally, but it looks like the 2KB cache is too small to guarantee no eviction happens between loading all the data into cache and reading from `kBlockCacheTier`. This 2KB setting appears to have come from a test where the cached entries are pinned, where it makes sense to have a small setting. However, such a small setting makes less sense when the blocks are evictable but must remain cached per the test's expectation. This PR increases the capacity setting to 2MB for those cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10636

Reviewed By: cbi42

Differential Revision: D39250976

Pulled By: ajkr

fbshipit-source-id: 769309f9a19cfac20b67b927805c8df5c1d2d1f5
2022-09-06 13:01:05 -07:00
Andrew Kryczka
1ffadbe9fc Deflake DBErrorHandlingFSTest.*WALWriteError (#10642)
Summary:
Example flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/17660/workflows/7a891875-f07b-4a67-b204-eaa7ca9f9aa2/jobs/467496

The test could get stuck in out-of-space due to a callback executing `SetFilesystemActive(false /* active */)` after the test executed `SetFilesystemActive(true /* active */)`. This could happen because background info logging went through the SyncPoint callback "WritableFileWriter::Append:BeforePrepareWrite", probably unintentionally. The solution of this PR is to call `ClearAllCallBacks()` to wait for any such pending callbacks to drain before calling `SetFilesystemActive(true /* active */)`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10642

Reviewed By: cbi42

Differential Revision: D39265381

Pulled By: ajkr

fbshipit-source-id: 9a2f4916ab19726c8fb4b3a3b590b1b9ed93de1b
2022-09-06 12:59:02 -07:00
Andrew Kryczka
fe5fbe32cb Deflake DBBlockCacheTest1.WarmCacheWithBlocksDuringFlush (#10635)
Summary:
Previously, automatic compaction could be triggered prior to the test invoking CompactRange(). It could lead to the following flaky failure:

```
/root/project/db/db_block_cache_test.cc:753: Failure
Expected equality of these values:
  1 + kNumBlocks
    Which is: 11
  options.statistics->getTickerCount(BLOCK_CACHE_INDEX_ADD)
    Which is: 10
```

A sequence leading to this failure was:

* Automatic compaction
  * files [1] [2] trivially moved
  * files [3] [4] [5] [6] trivially moved
* CompactRange()
  * files [7] [8] [9] trivially moved
  * file [10] trivially moved

In such a case, the index/filter block adds that the test expected did not happen since there were no new files.

This PR just tweaks settings to ensure the `CompactRange()` produces one new file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10635

Reviewed By: cbi42

Differential Revision: D39250869

Pulled By: ajkr

fbshipit-source-id: a3c94c49069e28c49c40b4b80dae0059739d19fd
2022-09-04 14:55:09 -07:00
Changyu Bi
30bc495c03 Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449)
Summary:
Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.

With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
- in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
- in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.

This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.

One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.

Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449

Test Plan:
- Added many unit tests in db_range_del_test
- Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
- Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.

```
python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
```

- Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
```
# Setup:
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50

# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true

# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true

# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
```
Avg over of 10 runs (some slower tests had fews runs):

For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.

- Scan entire DB

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
| 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
| 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
| 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
| 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
- Short range scan

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
| 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
| 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
| 10000 |1367                 |28667                |+1997.07%            |
| 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |

- Long range scan

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
| 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
| 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
| 10000 |25                   |414                  |+1556.0%             |
| 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |

- Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61

Reviewed By: ajkr

Differential Revision: D38450331

Pulled By: cbi42

fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
2022-09-02 09:51:19 -07:00
Levi Tamasi
b07217da04 Pin the newly cached blob after insert (#10625)
Summary:
With the current code, when a blob isn't found in the cache and gets read
from the blob file and then inserted into the cache, the application gets
passed the self-contained `PinnableSlice` resulting from the blob file read.
The patch changes this so that the `PinnableSlice` pins the cache entry
instead in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10625

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D39220904

Pulled By: ltamasi

fbshipit-source-id: cb9c62881e3523b1e9f614e00bf503bac2fe3b0a
2022-09-01 16:25:46 -07:00
anand76
5fbcc8c54d Update MULTIGET_IO_BATCH_SIZE for non-async MultiGet (#10622)
Summary:
This stat was only getting updated in the async (coroutine) version of MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10622

Reviewed By: akankshamahajan15

Differential Revision: D39188790

Pulled By: anand1976

fbshipit-source-id: 7e231507f65fc94c8a006c38f79dfba182a2c24a
2022-08-31 21:03:52 -07:00
Changyu Bi
3a75219e5d Validate option memtable_protection_bytes_per_key (#10621)
Summary:
sanity check value for option `memtable_protection_bytes_per_key` in `ColumnFamilyData::ValidateOptions()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10621

Test Plan: `make check`, added unit test in ColumnFamilyTest.

Reviewed By: ajkr

Differential Revision: D39180133

Pulled By: cbi42

fbshipit-source-id: 009e0da3ccb332d1c9e14d20193304610bd4eb8a
2022-08-31 17:47:07 -07:00
Levi Tamasi
01e88dfeb4 Support using cache warming with the secondary blob cache (#10603)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10603

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39117952

Pulled By: ltamasi

fbshipit-source-id: 5e956fa2fc18974876a5c87686acb50718e0edb7
2022-08-30 17:03:45 -07:00
Hui Xiao
8a85946f58 Add missing mutex when reading from shared variable bg_bottom_compaction_scheduled_, bg_compaction_scheduled_ (#10610)
Summary:
**Context/Summary:**
According to https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_job.h#L328-L332, any reading in the form of `*bg_compaction_scheduled_` , `*bg_bottom_compaction_scheduled_` should be protected by mutex, which isn't the case for some assert statement. This leads to a data race that can be repro-ed by the following command (command coming soon)

```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp
mkdir -p $exp

./db_stress --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=1 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --value_size_mult=32 --writepercent=90  --compaction_pri=4 --use_txn=1 --level_compaction_dynamic_level_bytes=True  --compaction_ttl=0  --compact_files_one_in=1000000 --compact_range_one_in=1000000 --value_size_mult=32 --verify_db_one_in=1000  --write_buffer_size=65536 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_key=25000000 --max_key_len=3 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --target_file_size_base=2097152 --target_file_size_multiplier=2
```
```
WARNING: ThreadSanitizer: data race (pid=73424)
  Read of size 4 at 0x7b8c0000151c by thread T13:
    #0 ReleaseSubcompactionResources internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:390 (db_stress+0x630aa3)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::CompactionJob::Run() internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:741 (db_stress+0x630aa3)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3436 (db_stress+0x60b2cc)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2950 (db_stress+0x606d79)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a)

  Previous write of size 4 at 0x7b8c0000151c by thread T12 (mutexes: write M438955329917552448):
    #0 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3018 (db_stress+0x6072a1)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a)

Location is heap block of size 6720 at 0x7b8c00000000 allocated by main thread:
    #0 operator new(unsigned long, std::align_val_t) <null> (db_stress+0xbab5bb)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) internal_repo_rocksdb/repo/db/db_impl/db_impl_open.cc:1811 (db_stress+0x69769a)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::TransactionDB**) internal_repo_rocksdb/repo/utilities/transactions/pessimistic_transaction_db.cc:258 (db_stress+0x8ae1f4)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::StressTest::Open(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2611 (db_stress+0x32b927)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::StressTest::InitDb(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:290 (db_stress+0x34712c)
```
This PR added all the missing mutex that should've been in place

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10610

Test Plan:
- Past repro command
- Existing CI

Reviewed By: riversand963

Differential Revision: D39143016

Pulled By: hx235

fbshipit-source-id: 51dd4db55ad306f3dbda5d0dd54d6f2513cf70f2
2022-08-30 16:24:01 -07:00
Yanqin Jin
7c0838e65e Use std::make_unique when possible (#10578)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10578

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D39064748

Pulled By: riversand963

fbshipit-source-id: c7c135b7b713608edb14614846050ece6d4cc59d
2022-08-29 19:09:29 -07:00
Hui Xiao
e484b81eee Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573)
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.

This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.

The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```

```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```

**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
    -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
    - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573

Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.

Reviewed By: ajkr

Differential Revision: D39005886

Pulled By: hx235

fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
2022-08-29 17:35:21 -07:00
Levi Tamasi
7818560194 Add a dedicated cache entry role for blobs (#10601)
Summary:
The patch adds a dedicated cache entry role for blob values and switches
to a registered deleter so that blobs show up as a separate bucket
(as opposed to "Misc") in the cache occupancy statistics, e.g.

```
Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601

Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D39107915

Pulled By: ltamasi

fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125
2022-08-29 16:11:59 -07:00
Peter Dillinger
c5afbbfe4b Don't wait for indirect flush in read-only DB (#10569)
Summary:
Some APIs for getting live files, which are used by Checkpoint
and BackupEngine, can optionally trigger and wait for a flush. These
would deadlock when used on a read-only DB. Here we fix that by assuming
the user wants the overall operation to succeed and is OK without
flushing (because the DB is read-only).

Follow-up work: the same or other issues can be hit by directly invoking
some DB functions that are clearly not appropriate for read-only
instance, but are not covered by overrides in DBImplReadOnly and
CompactedDBImpl. These should be fixed to avoid similar problems on
accidental misuse. (Long term, it would be nice to have a DBReadOnly
class without those members, like BackupEngineReadOnly.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569

Test Plan: tests updated to catch regression (hang before the fix)

Reviewed By: riversand963

Differential Revision: D38995759

Pulled By: pdillinger

fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68
2022-08-29 13:36:23 -07:00
Levi Tamasi
23376aa576 Improve the accounting of memory used by cached blobs (#10583)
Summary:
The patch improves the bookkeeping around the memory usage of
cached blobs in two ways: 1) it uses `malloc_usable_size`, which accounts
for allocator bin sizes etc., and 2) it also considers the memory usage
of the `BlobContents` object in addition to the blob itself. Note: some unit
tests had been relying on the cache charge being equal to the size of the
cached blob; these were updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10583

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39060680

Pulled By: ltamasi

fbshipit-source-id: 3583adce2b4ce6e84861f3fadccbfd2e5a3cc482
2022-08-26 15:53:08 -07:00
Jay Zhuang
d9e71fb2c5 Fix periodic_task unable to re-register the same task type (#10379)
Summary:
Timer has a limitation that it cannot re-register a task with the same name,
because the cancel only mark the task as invalid and wait for the Timer thread
to clean it up later, before the task is cleaned up, the same task name cannot
be added. Which makes the task option update likely to fail, which basically
cancel and re-register the same task name. Change the periodic task name to a
random unique id and store it in periodic_task_scheduler.

Also refactor the `periodic_work` to `periodic_task` to make each job function
as a `task`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379

Test Plan: unittests

Reviewed By: ajkr

Differential Revision: D38000615

Pulled By: jay-zhuang

fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e
2022-08-25 18:52:37 -07:00
Levi Tamasi
3f57d84af4 Introduce a dedicated class to represent blob values (#10571)
Summary:
The patch introduces a new class called `BlobContents`, which represents
a single uncompressed blob value. We currently use `std::string` for this
purpose; `BlobContents` is somewhat smaller but the primary reason for a
dedicated class is that it enables certain improvements and optimizations
like eliding a copy when inserting a blob into the cache, using custom
allocators, or more control over and better accounting of the memory usage
of cached blobs (see https://github.com/facebook/rocksdb/issues/10484).
(We plan to implement these in subsequent PRs.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39000965

Pulled By: ltamasi

fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96
2022-08-25 16:45:48 -07:00
Andrew Kryczka
7ad4b38617 Ensure writes to WAL tail during FlushWAL(true /* sync */) will be synced (#10560)
Summary:
WAL append and switch can both happen between `FlushWAL(true /* sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it.

Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560

Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found

Reviewed By: riversand963

Differential Revision: D38957391

Pulled By: ajkr

fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26
2022-08-25 12:53:46 -07:00
muthukrishnan.s
79ed4be80f Add get_name, get_id for column family handle in C API (#10499)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10499

Reviewed By: hx235

Differential Revision: D38523859

Pulled By: ajkr

fbshipit-source-id: 268bba1fcce4a3e20c51e498a79d7b476f663aea
2022-08-24 13:49:02 -07:00
Levi Tamasi
78bbdef530 Fix a typo in BlobSecondaryCacheTest (#10566)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10566

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38989926

Pulled By: ltamasi

fbshipit-source-id: 6402635fe745e4e7eb3083ef9ad9f04c0177d762
2022-08-24 13:08:43 -07:00
EdvardD
6e93d24935 Expose set_checksum function to C api (#10537)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10537

Reviewed By: hx235

Differential Revision: D38797662

Pulled By: ajkr

fbshipit-source-id: a8db723c3eb9d5592cd78f8be7e442e4826686ad
2022-08-23 14:59:27 -07:00
Chen Lixiang
9593fd1c82 Fix wrong compression type and options in universal compaction picker (#10515)
Summary:
In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515

Reviewed By: hx235

Differential Revision: D38611335

Pulled By: ajkr

fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d
2022-08-23 14:58:02 -07:00
anand76
35cdd3e71e MultiGet async IO across multiple levels (#10535)
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535

Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO

No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom :       7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations;  571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom :       7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations;  568.2 MB/s (8271992 of 8271992 found)`

For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom :       5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations;  797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom :       5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations;  790.7 MB/s (11649992 of 11649992 found) `

Reviewed By: akankshamahajan15

Differential Revision: D38774009

Pulled By: anand1976

fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
2022-08-19 16:52:52 -07:00
Levi Tamasi
81388b36e0 Add support for wide-column point lookups (#10540)
Summary:
The patch adds a new API `GetEntity` that can be used to perform
wide-column point lookups. It also extends the `Get` code path and
the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
accordingly so that wide-column entities can be served from both
memtables and SSTs. If the result of a lookup is a wide-column entity
(`kTypeWideColumnEntity`), it is passed to the application in deserialized
form; if it is a plain old key-value (`kTypeValue`), it is presented as a
wide-column entity with a single default (anonymous) column.
(In contrast, regular `Get` returns plain old key-values as-is, and
returns the value of the default column for wide-column entities, see
https://github.com/facebook/rocksdb/issues/10483 .)

The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
`PinnableWideColumns` contains a `PinnableSlice`, which either stores the
underlying data in its own buffer or holds on to a cache handle. It also contains
a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
so applications can access the values of columns efficiently.

There are several pieces of functionality which are currently not supported
for wide-column entities: there is currently no `MultiGetEntity` or wide-column
iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
is no `GetEntity` implementation for read-only and secondary instances.
We plan to implement these in future PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38847474

Pulled By: ltamasi

fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
2022-08-19 11:51:12 -07:00
Andrew Kryczka
91166012c8 Prevent a case of WriteBufferManager flush thrashing (#6364)
Summary:
Previously, the flushes triggered by `WriteBufferManager` could affect
the same CF repeatedly if it happens to get consecutive writes. Such
flushes are not particularly useful for reducing memory usage since
they switch nearly-empty memtables to immutable while they've just begun
filling their first arena block. In fact they may not even reduce the
mutable memory count if they involve replacing one mutable memtable containing
one arena block with a new mutable memtable containing one arena block.
Further, if such switches happen even a few times before a flush finishes,
the immutable memtable limit will be reached and writes will stall.

This PR adds a heuristic to not switch memtables to immutable for CFs
that already have one or more immutable memtables awaiting flush. There
is a memory usage regression if the user continues writing to the same
CF, that DB does not have any CFs eligible for switching, flushes
are not finishing, and the `WriteBufferManager` was constructed with
`allow_stall=false`. Before, it would grow by switching nearly empty
memtables until writes stall. Now, it would grow by filling memtables
until writes stall. This feels like an acceptable behavior change because
users who prefer to stall over violate the memory limit should be using
`allow_stall=true`, which is unaffected by this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6364

Test Plan:
- Command:

`rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num_multi_db=8 -num_column_families=2 -write_buffer_size=4194304 -db_write_buffer_size=16777216 -compression_type=none -statistics=true -target_file_size_base=4194304 -max_bytes_for_level_base=16777216`

- `rocksdb.db.write.stall` count before this PR: 175
- `rocksdb.db.write.stall` count after this PR: 0

Reviewed By: jay-zhuang

Differential Revision: D20167197

Pulled By: ajkr

fbshipit-source-id: 4a64064e9bc33d57c0a35f15547542d0191d0cb7
2022-08-17 15:53:40 -07:00
anand76
65814a4ae6 Fix range deletion handling in async MultiGet (#10534)
Summary:
The fix in https://github.com/facebook/rocksdb/issues/10513 was not complete w.r.t range deletion handling. It didn't handle the case where a file with a range tombstone covering a key also overlapped another key in the batch. In that case, ```mget_range``` would be non-empty. However, ```mget_range``` would only have the second key and, therefore, the first key would be skipped when iterating through the range tombstones in ```TableCache::MultiGet```.

Test plan -
1. Add a unit test
2. Run stress tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10534

Reviewed By: akankshamahajan15

Differential Revision: D38773880

Pulled By: anand1976

fbshipit-source-id: dae491dbe52e18bbce5179b77b63f20771a66c00
2022-08-17 13:51:39 -07:00
Gang Liao
275cd80cdb Add a blob-specific cache priority (#10461)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
2022-08-12 17:59:06 -07:00
sdong
bc575c614c Fix two extra headers (#10525)
Summary:
Fix copyright for two more extra headers to make internal tool happy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525

Reviewed By: jay-zhuang

Differential Revision: D38661390

fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94
2022-08-12 15:54:35 -07:00
Changyu Bi
fd165c869d Add memtable per key-value checksum (#10281)
Summary:
Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281

Test Plan:
- Added new unit test cases: `make check`
- Benchmark on memtable insert
```
TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

# avg over 10 runs
Baseline: 1166936 ops/sec
memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
```
-  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
ops/sec for `readrandom` dropped ~6.8%.
```
# Readseq
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec

With -memtable_protection_bytes_per_key=2:

readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec

# Readrandom
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec

With -memtable_protection_bytes_per_key=2:
readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
```

- Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`

Reviewed By: ajkr

Differential Revision: D37607896

Pulled By: cbi42

fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
2022-08-12 13:51:32 -07:00
Peter Dillinger
86a1e3e0e7 Derive cache keys from SST unique IDs (#10394)
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.

See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.

This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.

These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):

```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```

After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```

However, the change does modify (probably weaken) the "guaranteed unique" promise from this

> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86

to this (see https://github.com/facebook/rocksdb/issues/10388)

> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits

I don't think this is a practical concern, though.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394

Test Plan: unit tests updated, see simulation results above

Reviewed By: jay-zhuang

Differential Revision: D38667529

Pulled By: pdillinger

fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
2022-08-12 13:49:49 -07:00
Peter Dillinger
9fa5c146d7 LOG more info on oldest snapshot and sequence numbers (#10454)
Summary:
The info LOG file does not currently give any direct
information about the existence of old, live snapshots, nor how to
estimate wall time from a sequence number within the scope of LOG
history. This change addresses both with:
* Logging smallest and largest seqnos for generated SST files, which can
help associate sequence numbers with write time (based on flushes).
* Logging oldest_snapshot_seqno for each compaction, which (along with
that seqno info) helps us to determine how much old data might be kept
around for old (leaked?) snapshots. Including the date here I thought might
be excessive.

I wanted to log the date and seqno of the oldest snapshot with periodic
stats, but the current structure of the code doesn't really support that
because `DumpDBStats` doesn't have access to the DB object.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454

Test Plan:
manual inspect LOG from
`KEEP_DB=1 ./db_basic_test --gtest_filter=*CompactBetweenSnapshots*`

Reviewed By: ajkr

Differential Revision: D38326948

Pulled By: pdillinger

fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038
2022-08-12 13:08:50 -07:00
sdong
2297769b38 Fix regression issue of too large score (#10518)
Summary:
https://github.com/facebook/rocksdb/pull/10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
2022-08-11 17:10:36 -07:00
sherriiiliu
4753e5a2e9 Fix wrong value passed in compaction filter in BlobDB (#10391)
Summary:
New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type.

Test Case:
Add `FilterByValueLength` test case in `DBBlobCompactionTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391

Reviewed By: riversand963

Differential Revision: D38629900

Pulled By: ltamasi

fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa
2022-08-11 13:55:28 -07:00
sdong
9277569ba3 Add some missing headers (#10519)
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519

Reviewed By: jay-zhuang

Differential Revision: D38603291

fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
2022-08-11 12:45:50 -07:00
Jay Zhuang
5d3aefb682 Migrate to docker for CI run (#10496)
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D38554200

Pulled By: jay-zhuang

fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
2022-08-10 17:34:38 -07:00
Guido Tagliavini Ponce
a0798f6f92 Enable ClockCache in DB block cache test (#10482)
Summary:
A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482

Test Plan: ``make -j24 check``

Reviewed By: anand1976

Differential Revision: D38553073

Pulled By: guidotag

fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae
2022-08-10 13:57:52 -07:00
sdong
911c0208b9 WritableFileWriter tries to skip operations after failure (#10489)
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489

Test Plan: Fix existing unit tests and watch crash tests for a while.

Reviewed By: anand1976

Differential Revision: D38473277

fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
2022-08-10 10:19:20 -07:00
Yanqin Jin
fee2c472d0 Include minimal contextual information in CompactionIterator (#10505)
Summary:
The main purpose is to make debugging easier without sacrificing performance.

Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`,
using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where
the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches
`PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505

Test Plan:
make check
```
TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush
```
The above command has a 'flush' benchmark which uses `CompactionIterator`.  I haven't observed any CPU regression or drop in throughput or latency increase.

Reviewed By: ltamasi

Differential Revision: D38551615

Pulled By: riversand963

fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0
2022-08-09 17:07:24 -07:00
anand76
0b02960d8c Fix MultiGet range deletion handling and a memory leak (#10513)
Summary:
This PR fixes 2 bugs introduced in https://github.com/facebook/rocksdb/issues/10432 -
1. If the bloom filter returned a negative result for all MultiGet keys in a file, the range tombstones in that file were being ignored, resulting in incorrect results if those tombstones covered a key in a higher level.
2. If all the keys in a file were filtered out in `TableCache::MultiGetFilter`, the table cache handle was not being released.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10513

Test Plan: Add a new unit test that fails without this fix

Reviewed By: akankshamahajan15

Differential Revision: D38548739

Pulled By: anand1976

fbshipit-source-id: a741a1e25d2e991d63f038100f126c2dc404a87c
2022-08-09 14:44:47 -07:00
Levi Tamasi
06b04127a8 Reset blob value as soon as it's not needed in DBIter (#10490)
Summary:
We have recently added caching support to BlobDB, and separately,
implemented an optimization where reading blobs from the cache
results in the cache handle being transferred to the target `PinnableSlice`
(as opposed to the contents getting copied). With these changes,
it makes sense to reset the `PinnableSlice` storing the blob value in
`DBIter` as soon as we move to a different iterator position to prevent
us from holding on to the cache handle any longer than necessary.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38473630

Pulled By: ltamasi

fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386
2022-08-09 11:39:57 -07:00
Levi Tamasi
24bcab7d5d Make queries return the value of the default column for wide-column entities (#10483)
Summary:
The patch adds support for wide-column entities to the existing query
APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a
wide-column entity is encountered, we will return the value of the default
(anonymous) column as the result. Later, we plan to add wide-column
specific query APIs which will enable retrieving entire wide-column entities
or a subset of their columns.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38441881

Pulled By: ltamasi

fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543
2022-08-08 16:10:08 -07:00
Jay Zhuang
3f763763aa Change bottommost_temperture to last_level_temperture (#10471)
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
2022-08-08 14:36:34 -07:00
Jay Zhuang
375534752a Improve universal compaction picker for tiered compaction (#10467)
Summary:
Current universal compaction picker may cause extra size amplification
compaction if there're more hot data on penultimate level. Improve the picker
to skip the last level for size amp calculation if tiered compaction is
enabled, which can
1. avoid extra unnecessary size amp compaction;
2. typically cold tier (the last level) is not size constrained, so skip size
   amp for cold tier is intended;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467

Test Plan: CI and added unittest

Reviewed By: siying

Differential Revision: D38391350

Pulled By: jay-zhuang

fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2
2022-08-08 14:34:36 -07:00
Levi Tamasi
0cc9e98bbb Respect fill_cache when reading blobs in DBIter (#10492)
Summary:
Similarly to https://github.com/facebook/rocksdb/pull/10457, we now have
to explicitly set the `fill_cache` read option when reading blobs in
`DBIter` to prevent the cache from getting polluted by queries with
`fill_cache` set to false. (Before we added support for a blob cache,
the setting had not made any difference either way.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10492

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38476121

Pulled By: ltamasi

fbshipit-source-id: ea5c5e252f83e4a4e2c74156b37d40308d7e0c80
2022-08-08 08:26:33 -07:00
Burton Li
e446bc65e6 Remove local static string (#8103)
Summary:
Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier.

P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103

Reviewed By: ajkr

Differential Revision: D38477235

Pulled By: ltamasi

fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9
2022-08-05 23:03:51 -07:00