* Log failed connection attempts in monitorProxies
* Update coordinator list from the cluster file after failing to connect to all coordinators
* Wiggle and upgrade test with legacy version monitoring; updating tests to use 7.1.9
* Update coordinator list from the cluster file: addressing review comments
* Update coordinator list from the cluster file: addressing review comments
* Wait on future for all setAndPersistConnectionString calls
* Defer recoveredDiskFiles wait if Encryption data at-rest is enabled
Description
In the current code ClusterController startup wait for 'recoveredDiskFiles'
future to complete before triggered 'clusterControllerCore' actor, which
inturn starts 'EncryptKeyProxy' (EKP) actor resposible to fetch/refresh
encryption keys needed for ClusterRecovery as well interactions with
KMS.
Patch addresses a circular dependency where StorageServer initialization
depends on EKP, but, CC doesn't recruit EKP till 'recoveredDiskFiles' completes
which includes SS initialization. Given 'recoveredDiskFiles' is an optimization,
the patch proposes deferring the 'recoveredDiskFiles' future completion until
new Master recruitment is done as part of ClusterRecovery (unblock EKP singleton)
Testing
Ran 500K correctness runs: 20220618-055310-ahusain-foundationdb-61c431d467557551
Recorded failures doesn't seems to be related to the change.
* Add an DD tenant-cache-assembly actor
* Add basic tenant list monitoring for tenant cache.
* Update DD tenant cache refresh to be more efficient and unit-testable
* Remove the DD prefix in the tenant cache class name (and associated impl and UT class names); there is nothing specific to DD in it; DD uses it; other modules may use it in the future
* Disable DD tenant awareness by default
* fix a fault injection bug in txn store recovery
* Update LogSystemDiskQueueAdapter.actor.cpp
typo
* recoverLoc can be overwritten, so on reset use the stored range start
Currently, a std::string is copied unnecessarily for every key and value
in a trace event.
This actually showed up in a jemalloc heap profile while I was
investigating something unrelated. I was surprised to see it since these
allocations should have a very short lifetime.
* Generate GNU compatible build-id for mockkms golang binary
Description
diff-1: Fix compilation issue
Generate GNU compatible build-id for mockkms golang binary
Leverage "cgo" to generate build-id
Testing
Debian package build, verified the GNU build-id
Description
Major changes include:
1. GetEncryptByKeyIds cache elements can expire.
2. Update iterator after erasing an element during refresh encryption keys
operation.
Testing
EncryptKeyProxyTest
Adding encryption support for TxnStateStore. It is done by supporting encryption. for KeyValueStoreMemory. The encryption is currently done on operation level when the operations are being write to the underlying log file. See inline comment for the encrypted data format.
This PR depends on #7252. It is part of the effort to support TLog encryption #6942.
This patch is to fix the compile error
/root/src/fdbclient/S3BlobStore.actor.cpp:410:9: error: moving a local
object in a return statement prevents copy elision
[-Werror,-Wpessimizing-move]
return std::move(resource);
^
/root/src/fdbclient/S3BlobStore.actor.cpp:410:9: note: remove std::move
call here
return std::move(resource);
^~~~~~~~~~ ~
1 error generated.
This is to fix an issue when recovery and change coordinator key happens
together. The issue will occur when:
1. Recovery starts
2. Coordinator key change transaction started
3. During the recovery the coordinator key is read from cluster file and
stored in the storage server
4. The cluster controller received `ChangeCoordiatorsRequest`, and
updated the cluster name with the new value.
at this stage, the value related to coordinator key in storage server and
the worker is inconsistent.
5. changeQuorumChecker is called, which will verify such consistency.
Since they are different, the call is returning failure and the
caller, which could be a TEST_CASE, fails.
This is a rare race issue, and it is also noticed that when the
recovery/coordinator key change process is done, the database is in a
proper state which allows changeQuorumChecker behave properly. In this
case, a retry mechanism should be sufficiently fix corresponding test
failures.
* Work around flow trace's data race bug
BaseTraceEvent::setNetworkThread() and flushTraceFile[()|Void()]
has a long-standing race condition for traceEventThrottlerCache global
when flushTraceFileVoid() is not called from the network thread.
This race dates back to 2017 (commit hash 80e5fecfe2),
so before the race itself is fixed, work around the problem.
* Remove call to flushTraceFileVoid() from MkCertCli
* Apply clang format
* Close trace file when error happens in runNetwork().
* Improve the bestCount algorithm in getLeader().
In the current implementation, if the nominees are [0,1], the chosen leader will be 1, which is an exception to other cases and our expectation that if 2 nominees have the same frequency, the one with lower id will be the leader.
* Remove unnecessary new statement.
stream will never be a nullptr.
* Move self->dnsCache out of lambda capture.
Member variables are not capture by default, thus, `host` and `service` are not captured. This somehow successfully compile, but throws std::bad_alloc or basic_string::_S_create exceptions when we call `host+":"+service` in dnsCache.remove().
* Revert unintended change.
* Address comments.