Clients now poll the proxy for the latest global config for a specific
version. The proxy now periodically requests the latest global
configuration data and stores it in memory, enabling it to respond
immediately to clients with the appropriate version.
* Storage server shard management with physical shards.
* Cleanup.
* Resolved comments.
* Added `UnlimintedCommitBytes`.
Co-authored-by: He Liu <heliu@apple.com>
* Shard based move.
* Clean up.
* Clear results on retry in getInitialDataDistribution.
* Remove assertion on SHARD_ENCODE_LOCATION_METADATA for compatibility.
* Resolved comments.
Co-authored-by: He Liu <heliu@apple.com>
* Fix comments
* Add simulation value for SERVER_KNOBS->SNAP_CREATE_MAX_TIMEOUT
* A work version with correctness clean
* Remove unnecessay comments; debugging symbols
* Only check secondary address for coordinators, same as before
* Change the trace to SevError and remove the ASSERT(false)
* Remove TLogSnapRequest handling on TlogServer, which is changed to use WorkerSnapRequest
* Add retry for network failures
* Add retry limit for network failures; still allow duplicate snapshots on processes are both tlog and storage to avoid race
* Add retry limit as a knob and make backoff exponentail
* Add getDatabaseConfiguration(Transaction* tr)
* revert back to send request for each role once
* update some comments
* Add an DD tenant-cache-assembly actor
* Add basic tenant list monitoring for tenant cache.
* Update DD tenant cache refresh to be more efficient and unit-testable
* Remove the DD prefix in the tenant cache class name (and associated impl and UT class names); there is nothing specific to DD in it; DD uses it; other modules may use it in the future
* Disable DD tenant awareness by default
* Don't fail test if log cursor times out during network partition
Also, exercise the codepath for handling timed_out in simulation, by
reverting this knob buggification behavior to that of 07976993e7.
* clang-format
Adding GetEncryptCipherKeys and GetLatestCipherKeys helper actors, which encapsulate cipher key fetch logic: getting cipher keys from local BlobCipherKeyCache, and on cache miss fetch from EKP (encrypt key proxy). These helper actors also handles the case if EKP get shutdown in the middle, they listen on ServerDBInfo to wait for new EKP start and send new request there instead.
The PR also have other misc changes:
* EKP is by default started in simulation regardless of. ENABLE_ENCRYPTION knob, so that in restart tests, if ENABLE_ENCRYPTION is switch from on to off after restart, encrypted data will still be able to be read.
* API tweaks for BlobCipher
* Adding a ENABLE_TLOG_ENCRYPTION knob which will be used in later PRs. The knob should normally be consistent with ENABLE_ENCRYPTION knob, but could be used to disable TLog encryption alone.
This PR is split out from #6942.
* Add simulation test for 1 data hall + 1 machine failure case.
* Disable BUGGIFY for DEGRADED_RESET_INTERVAL.
A simulation test discovered a situation where machines attempting to connect
to a dead coordinator (with a well-known endpoint) were getting themselves
marked degraded. This flapping of the degraded state prevented recovery from
completing, as it started over any time it noticed that tlogs on degraded
hosts could be relocated to non-degraded ones.
bin/fdbserver -r simulation -f tests/rare/CycleWithDeadHall.toml -b on -s 276841956