9532 Commits

Author SHA1 Message Date
Ata E Husain Bohra
e1ca0ef9a2
Defer recoveredDiskFiles wait if Encryption data at-rest is enabled (#7414)
* Defer recoveredDiskFiles wait if Encryption data at-rest is enabled

Description

In the current code ClusterController startup wait for 'recoveredDiskFiles'
future to complete before triggered 'clusterControllerCore' actor, which
inturn starts 'EncryptKeyProxy' (EKP) actor resposible to fetch/refresh
encryption keys needed for ClusterRecovery as well interactions with
KMS.

Patch addresses a circular dependency where StorageServer initialization
depends on EKP, but, CC doesn't recruit EKP till 'recoveredDiskFiles' completes
which includes SS initialization. Given 'recoveredDiskFiles' is an optimization,
the patch proposes deferring the 'recoveredDiskFiles' future completion until
new Master recruitment is done as part of ClusterRecovery (unblock EKP singleton)

Testing

Ran 500K correctness runs: 20220618-055310-ahusain-foundationdb-61c431d467557551
Recorded failures doesn't seems to be related to the change.
2022-06-21 18:18:57 -07:00
Bharadwaj V.R
8cf2be030f
Build a TenantCache for use by DD (#7207)
* Add an DD tenant-cache-assembly actor
* Add basic tenant list monitoring for tenant cache. 
* Update DD tenant cache refresh to be more efficient and unit-testable
* Remove the DD prefix in the tenant cache class name (and associated impl and UT class names); there is nothing specific to DD in it; DD uses it; other modules may use it in the future
* Disable DD tenant awareness by default
2022-06-21 16:29:30 -07:00
Lukas Joswiak
9ca8a3c683 Reenable status json for dynamic knobs, add unit test 2022-06-21 11:43:05 -07:00
Dan Lambright
c48d569024
fix a fault injection bug in txn store recovery (#7405)
* fix a fault injection bug in txn store recovery

* Update LogSystemDiskQueueAdapter.actor.cpp

typo

* recoverLoc can be overwritten, so on reset use the stored range start
2022-06-21 12:33:58 -04:00
Josh Slocum
34e6a8f942
Merge pull request #7399 from sfc-gh-jslocum/bg_tenant_improvements
Bg tenant improvements
2022-06-17 11:19:41 -05:00
Markus Pilman
5aacaf891c
Merge pull request #7321 from sfc-gh-ajbeamon/multiple-tenant-creation
Support creating multiple tenants in the same transaction
2022-06-17 10:10:09 -06:00
Xiaoxi Wang
6bb4e341f9
Merge pull request #7110 from sfc-gh-xwang/features/ppw-pause-state
Adding paused/running wiggling status to status json and also the last running/paused timestamp
2022-06-16 14:27:18 -07:00
Xiaoxi Wang
a311cc28cc solve some comments 2022-06-16 11:07:21 -07:00
Josh Slocum
b3597ef3a8 Added plumbing for tenant-aware purge granules 2022-06-16 13:04:34 -05:00
Andrew Noyes
83aceb216c
Use absl::GetStackTrace for slow task profiler (#7374)
* Make SlowTask workload runnable in joshua

* Remove SignalSafeUnwind, and use absl::GetStackTrace for slow task profiler
2022-06-15 14:53:52 -07:00
Sreenath Bodagala
2c85bb71c1 - Do not try to figure out the sequencer locality if knob
ENABLE_VERSION_VECTOR_HA_OPTIMIZATION is disabled.
2022-06-15 16:08:31 +00:00
Ata E Husain Bohra
8808d93813
Fix bugs in EncyrptKeyProxy actor (#7388)
Description

Major changes include:
1. GetEncryptByKeyIds cache elements can expire.
2. Update iterator after erasing an element during refresh encryption keys
   operation.

Testing

EncryptKeyProxyTest
2022-06-14 21:22:25 -07:00
Yao Xiao
7da26db342
[ShardedRocksDB] 4/N Support removeRange. (#7345) 2022-06-14 13:52:03 -07:00
Yi Wu
6246664006
Support encrypting TxnStateStore (#7253)
Adding encryption support for TxnStateStore. It is done by supporting encryption. for KeyValueStoreMemory. The encryption is currently done on operation level when the operations are being write to the underlying log file. See inline comment for the encrypted data format.

This PR depends on #7252. It is part of the effort to support TLog encryption #6942.
2022-06-14 13:26:32 -07:00
Trevor Clinkenbeard
6bed046148
Merge pull request #7352 from sfc-gh-xwang/feature/ddtxn
[DD testability enhancement] Create IDDTxnProcessor and simple refactoring
2022-06-13 16:01:13 -07:00
Xiaoxi Wang
ef0f415e3d add option; change to shared_ptr 2022-06-13 13:55:48 -07:00
Andrew Noyes
013b290ca5
Don't fail test if log cursor times out during network partition (#7330)
* Don't fail test if log cursor times out during network partition

Also, exercise the codepath for handling timed_out in simulation, by
reverting this knob buggification behavior to that of 07976993e7.

* clang-format
2022-06-13 13:28:22 -07:00
Trevor Clinkenbeard
942d687506
Clean up includes in actor header files (#7331)
* Remove unnecessary actorcompiler.h includes (from non-actor files)

* Make AsyncFileChaos a non-actor header file

* Add unactorcompiler.h include to the end of actor header files

* Add missing actorcompiler.h includes to actor header files
2022-06-13 13:26:51 -07:00
Ata E Husain Bohra
a5d91fe18a
KmsConnector implementation to support KMS driven CipherKey TTL (#7334)
* KmsConnector implementation to support KMS driven CipherKey TTL

Description

KMS CipherKeys can be of two types:
1. Revocable CipherKeys: having a finite lifetime, after which the CipherKey
shouldn't be used by the FDB.
2. Non-revocable CipherKeys: ciphers are not revocable, however, FDB would
still want to refresh ciphers to support KMS cipher rotation feature.

Patch proposes following change to incorporate support for above defined cipher-key
types:
1. Extend KmsConnector response to include optional 'refreshAfter' & 'expireAfter'
time intervals. EncryptKeyProxy (EKP) cache would define corresponding absolute refresh &
expiry timestamp for a given cipherKey. On an event of transient KMS connectivity outage,
a caller of EKP API for a non-revocable key should continue using cached cipherKey until
it expires.
2. Simplify KmsConnector API arena handling by using VectorRef to represent component
structs and manage associated memory allocation/lifetime.

Testing

1. EncryptKeyProxyTest
2. RESTKmsConnectorTest
3. SimKmsConnectorTest

* KmsConnector implementation to support KMS driven CipherKey TTL

Description

  diff-1: Set expireTS for baseCipherId indexed cache

KMS CipherKeys can be of two types:
1. Revocable CipherKeys: having a finite lifetime, after which the CipherKey
shouldn't be used by the FDB.
2. Non-revocable CipherKeys: ciphers are not revocable, however, FDB would
still want to refresh ciphers to support KMS cipher rotation feature.

Patch proposes following change to incorporate support for above defined cipher-key
types:
1. Extend KmsConnector response to include optional 'refreshAfter' & 'expireAfter'
time intervals. EncryptKeyProxy (EKP) cache would define corresponding absolute refresh &
expiry timestamp for a given cipherKey. On an event of transient KMS connectivity outage,
a caller of EKP API for a non-revocable key should continue using cached cipherKey until
it expires.
2. Simplify KmsConnector API arena handling by using VectorRef to represent component
structs and manage associated memory allocation/lifetime.

Testing

1. EncryptKeyProxyTest
2. RESTKmsConnectorTest
3. SimKmsConnectorTest

* KmsConnector implementation to support KMS driven CipherKey TTL

Description

  diff-2: Fix Valgrind issues discovered runnign tests
  diff-1: Set expireTS for baseCipherId indexed cache

KMS CipherKeys can be of two types:
1. Revocable CipherKeys: having a finite lifetime, after which the CipherKey
shouldn't be used by the FDB.
2. Non-revocable CipherKeys: ciphers are not revocable, however, FDB would
still want to refresh ciphers to support KMS cipher rotation feature.

Patch proposes following change to incorporate support for above defined cipher-key
types:
1. Extend KmsConnector response to include optional 'refreshAfter' & 'expireAfter'
time intervals. EncryptKeyProxy (EKP) cache would define corresponding absolute refresh &
expiry timestamp for a given cipherKey. On an event of transient KMS connectivity outage,
a caller of EKP API for a non-revocable key should continue using cached cipherKey until
it expires.
2. Simplify KmsConnector API arena handling by using VectorRef to represent component
structs and manage associated memory allocation/lifetime.

Testing

1. EncryptKeyProxyTest
2. RESTKmsConnectorTest
3. SimKmsConnectorTest

* KmsConnector implementation to support KMS driven CipherKey TTL

Description

  diff-3: Address review comment
  diff-2: Fix Valgrind issues discovered runnign tests
  diff-1: Set expireTS for baseCipherId indexed cache

KMS CipherKeys can be of two types:
1. Revocable CipherKeys: having a finite lifetime, after which the CipherKey
shouldn't be used by the FDB.
2. Non-revocable CipherKeys: ciphers are not revocable, however, FDB would
still want to refresh ciphers to support KMS cipher rotation feature.

Patch proposes following change to incorporate support for above defined cipher-key
types:
1. Extend KmsConnector response to include optional 'refreshAfter' & 'expireAfter'
time intervals. EncryptKeyProxy (EKP) cache would define corresponding absolute refresh &
expiry timestamp for a given cipherKey. On an event of transient KMS connectivity outage,
a caller of EKP API for a non-revocable key should continue using cached cipherKey until
it expires.
2. Simplify KmsConnector API arena handling by using VectorRef to represent component
structs and manage associated memory allocation/lifetime.

Testing

1. EncryptKeyProxyTest
2. RESTKmsConnectorTest
3. SimKmsConnectorTest
2022-06-13 13:25:01 -07:00
Xiaoxi Wang
1de6c09307 use struct instead of tuple 2022-06-13 11:27:50 -07:00
Xiaoxi Wang
c12a7a30ed
Update fdbserver/DataDistributionQueue.actor.cpp
Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
2022-06-13 08:22:48 -07:00
Xiaoxi Wang
9604db3f10
Update fdbserver/DDTxnProcessor.h
Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
2022-06-13 08:19:14 -07:00
Xiaoxi Wang
fb66561bc4 format code 2022-06-09 14:43:09 -07:00
Xiaoxi Wang
7ee6808ebd solve compiler warning 2022-06-09 14:32:24 -07:00
Xiaoxi Wang
b99bd45730 format code 2022-06-09 12:36:20 -07:00
Xiaoxi Wang
e5aa5fef22 merge upstream/main 2022-06-09 12:17:27 -07:00
Xiaoxi Wang
6ab12ea971 add storeTuple and unit test; refactor getSourceServersForRange 2022-06-09 12:16:12 -07:00
Yao Xiao
0bb02f6415
[Sharded RocksDB] 3/N Implement functions for range clear. (#7310) 2022-06-09 10:50:39 -07:00
Jingyu Zhou
7acd184a38
Merge pull request #7339 from jzhou77/fix-status-memory
Add rss_bytes to process memory and fix available_bytes calculation
2022-06-08 13:10:51 -07:00
Jingyu Zhou
b9ff6bc129 Address AJ's comments 2022-06-08 09:38:32 -07:00
Sreenath Bodagala
fe5f11358f
Merge pull request #7318 from sbodagala/main
Introduce a knob that controls the placement of remote storage server commit versions in version vector
2022-06-08 12:18:15 -04:00
Markus Pilman
d141347500
Merge pull request #7282 from Doxense/fix-windows-tests
Fix windows tests
2022-06-08 08:18:47 -06:00
Bharadwaj V.R
d4b983264b
Merge branch 'apple:main' into ddneat 2022-06-07 23:10:56 -07:00
Bharadwaj V.R
b40553556b
Merge pull request #7281 from sfc-gh-bvr/mcvf-nothrottle
Remove last-limited check from DDMountainChopper and DDValleyFiller
2022-06-07 21:15:47 -07:00
Yi Wu
bbf8cb4b02
GetEncryptCipherKeys helper function and misc encryption changes (#7252)
Adding GetEncryptCipherKeys and GetLatestCipherKeys helper actors, which encapsulate cipher key fetch logic: getting cipher keys from local BlobCipherKeyCache, and on cache miss fetch from EKP (encrypt key proxy). These helper actors also handles the case if EKP get shutdown in the middle, they listen on ServerDBInfo to wait for new EKP start and send new request there instead.

The PR also have other misc changes:
* EKP is by default started in simulation regardless of. ENABLE_ENCRYPTION knob, so that in restart tests, if ENABLE_ENCRYPTION is switch from on to off after restart, encrypted data will still be able to be read.
* API tweaks for BlobCipher
* Adding a ENABLE_TLOG_ENCRYPTION knob which will be used in later PRs. The knob should normally be consistent with ENABLE_ENCRYPTION knob, but could be used to disable TLog encryption alone.

This PR is split out from #6942.
2022-06-07 21:00:13 -07:00
Jingyu Zhou
217ba24b6f Add rss_bytes to process memory and fix available_bytes calculation
Since memory is now limited with RSS size, add RSS size in status json for
reporting. Also change how available_bytes is calculated from:
  (available + virtual memory) * process_limit / machine_limit
to:
  (available memory) * process_limit / machine_limit
2022-06-07 16:44:14 -07:00
Andrew Noyes
1997e6057c
Fix a heap-use-after-free in a unit test (#7230)
* Fix a heap-use-after-free in a unit test

The data passed to IAsyncFile::write must remain valid until the future
is ready.

* Use holdWhile instead of a new state variable
2022-06-07 14:48:01 -07:00
Josh Slocum
a0bb585260
Merge pull request #7333 from sfc-gh-jslocum/blob_metadata_valgrind_fix
fixes for blob metadata memory from valgrind
2022-06-07 15:24:11 -05:00
Andrew Noyes
1f8fc32f41
Save a memcpy in the tlog peek path (#7328) 2022-06-07 13:22:56 -07:00
Xiaoxi Wang
21e7e6d2ba add DDTxnProcessor (incomplete) 2022-06-07 11:58:16 -07:00
Josh Slocum
ae865027d6 fixes for blob metadata memory from valgrind 2022-06-07 13:50:11 -05:00
Xiaoxi Wang
541f98e111 create DDTxnProcessor 2022-06-07 11:48:59 -07:00
Sreenath Bodagala
96a88e3847 Merge remote-tracking branch 'apple-upstream/main' 2022-06-07 18:38:35 +00:00
A.J. Beamon
4f308b34fc Fix an off-by-one error in determining whether to include the entire range in the conflict ranges when a reverse range read returns early due to limit. 2022-06-07 08:52:10 -07:00
Yao Xiao
5f1a061e3a
Disable rocksdb metrics. (#7327) 2022-06-06 14:27:41 -07:00
Bharadwaj V.R
aa84f8925e
Merge branch 'apple:main' into mcvf-nothrottle 2022-06-06 13:18:11 -07:00
Dan Adkins
bd47f390bd
Add simulation test for three_data_hall configuration (#7305)
* Add simulation test for 1 data hall + 1 machine failure case.

* Disable BUGGIFY for DEGRADED_RESET_INTERVAL.

A simulation test discovered a situation where machines attempting to connect
to a dead coordinator (with a well-known endpoint) were getting themselves
marked degraded. This flapping of the degraded state prevented recovery from
completing, as it started over any time it noticed that tlogs on degraded
hosts could be relocated to non-degraded ones.

bin/fdbserver -r simulation -f tests/rare/CycleWithDeadHall.toml -b on -s 276841956
2022-06-06 13:14:49 -07:00
Bharadwaj V.R
990c789a5c
Increase quiet-database timeout when buggify is on; data-movements in simulation take longer than the timeout allows, and waiting for quiet-database does succeed when given some more time (#7290) 2022-06-06 13:13:11 -07:00
Josh Slocum
a3289f9cab adding tenant prefix to bg ranges call 2022-06-06 14:09:10 -05:00
Bharadwaj V.R
7f079a6c29
Merge branch 'apple:main' into mcvf-nothrottle 2022-06-06 12:03:13 -07:00