* Add an internal C API to support memory connection records
* Track shared state in the client using a unique and immutable cluster ID from the cluster
* Add missing code to store the clusterId in the database state object
* Update some arguments to pass by const&
* Defer recoveredDiskFiles wait if Encryption data at-rest is enabled
Description
In the current code ClusterController startup wait for 'recoveredDiskFiles'
future to complete before triggered 'clusterControllerCore' actor, which
inturn starts 'EncryptKeyProxy' (EKP) actor resposible to fetch/refresh
encryption keys needed for ClusterRecovery as well interactions with
KMS.
Patch addresses a circular dependency where StorageServer initialization
depends on EKP, but, CC doesn't recruit EKP till 'recoveredDiskFiles' completes
which includes SS initialization. Given 'recoveredDiskFiles' is an optimization,
the patch proposes deferring the 'recoveredDiskFiles' future completion until
new Master recruitment is done as part of ClusterRecovery (unblock EKP singleton)
Testing
Ran 500K correctness runs: 20220618-055310-ahusain-foundationdb-61c431d467557551
Recorded failures doesn't seems to be related to the change.
Adding GetEncryptCipherKeys and GetLatestCipherKeys helper actors, which encapsulate cipher key fetch logic: getting cipher keys from local BlobCipherKeyCache, and on cache miss fetch from EKP (encrypt key proxy). These helper actors also handles the case if EKP get shutdown in the middle, they listen on ServerDBInfo to wait for new EKP start and send new request there instead.
The PR also have other misc changes:
* EKP is by default started in simulation regardless of. ENABLE_ENCRYPTION knob, so that in restart tests, if ENABLE_ENCRYPTION is switch from on to off after restart, encrypted data will still be able to be read.
* API tweaks for BlobCipher
* Adding a ENABLE_TLOG_ENCRYPTION knob which will be used in later PRs. The knob should normally be consistent with ENABLE_ENCRYPTION knob, but could be used to disable TLog encryption alone.
This PR is split out from #6942.
The special keys `\xff\xff/management/profiling/client_txn_sample_rate`
and `\xff\xff/management/profiling/client_txn_size_limit` are deprecated
in FDB 7.2. However, GlobalConfig was introduced in 7.0, and reading and
writing these keys through the special key space was broken in 7.0+.
This change modifies the profiling special keys to use GlobalConfig
behind the scenes, fixing the broken special keys.
The following Python script was used to make sure both GlobalConfig and
the profiling special key can be used to read/write/clear profiling
data:
```
import fdb
import time
fdb.api_version(710)
@fdb.transactional
def set_sample_rate(tr):
tr.options.set_special_key_space_enable_writes()
# Alternative way to write the key
#tr[b'\xff\xff/global_config/config/fdb_client_info/client_txn_sample_rate'] = fdb.tuple.pack((5.0,))
tr[b'\xff\xff/management/profiling/client_txn_sample_rate'] = '5.0'
@fdb.transactional
def clear_sample_rate(tr):
tr.options.set_special_key_space_enable_writes()
# Alternative way to clear the key
#tr.clear(b'\xff\xff/global_config/config/fdb_client_info/client_txn_sample_rate')
tr[b'\xff\xff/management/profiling/client_txn_sample_rate'] = 'default'
@fdb.transactional
def get_sample_rate(tr):
print(tr.get(b'\xff\xff/global_config/config/fdb_client_info/client_txn_sample_rate'))
# Alternative way to read the key
#print(tr.get(b'\xff\xff/management/profiling/client_txn_sample_rate'))
fdb.options.set_trace_enable()
fdb.options.set_trace_format('json')
db = fdb.open()
get_sample_rate(db) # None (or 'default')
set_sample_rate(db)
time.sleep(1) # Allow time for global config changes to propagate
get_sample_rate(db) # 5.0
clear_sample_rate(db)
time.sleep(1)
get_sample_rate(db) # None (or 'default')
```
It can be run with `PYTHONPATH=./bindings/python/ python profiling.py`,
and reads the `fdb.cluster` file in the current directory.
```
$ PYTHONPATH=./bindings/python/ python sps.py
None
5.000000
None
```
* Add tryResolveHostnames() in connection string.
* Add missing hostname to related interfaces.
* Do not pass RequestStream into *GetReplyFromHostname() functions.
Because we are using new RequestStream for each request anyways. Also, the passed in pointer could be nullptr, which results in seg faults.
* Add dynamic hostname resolve and reconnect intervals.
* Address comments.
* EncryptKeyProxy server APIs for simulation runs.
Description
diff-2: FlowSingleton util class
Bug fixes
diff-1: Expected errors returned to the caller
Major changes proposed are:
1. EncryptKeyProxy server APIs:
1.1. Lookup Cipher details via BaseCipherId
1.2. Lookup latest Cipher details via encryption domainId.
2. EncyrptKeyProxy implements caches indexed by: baseCipherId &
encyrptDomainId
3. Periodic task to refresh domainId indexed cache to support
'limiting cipher lifetime' abilities if supported by
external KMS solutions.
Testing
EncyrptKeyProxyTest workload to validate the newly added code.
Description
Major changes proposed are:
1. Rename ServerKnob->ENABLE_ENCRYPT_KEY_PROXY to
ServerKnob->ENABLE_ENCRYPTION. Approach simplifies enabling
controlling encyrption code change using a single knob (desirable)
2. Implement EncyrptKeyVaultProxy simulated interface to assist
validating encyrption workflows in simulation runs. The interface
is leveraged to satisfy "encryption keys" lookup which otherwise
gets satisfied by integrating organization preferred Encryption
Key Management solution.
Testing
Unit test to validate the newly added code
diff-1: Address review comments
Major changes includes:
1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
responsible to expose APIs to fetch encyrption keys interacting
with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
needed during recovery (decode TLog records), for now the process
is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
subsequent PRs.
NOTE: The code is protected under a SERVER_KNOB with the default
value as 'false' for now.:%s
Major changes includes:
1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
responsible to expose APIs to fetch encyrption keys interacting
with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
needed during recovery (decode TLog records), for now the process
is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
subsequent PRs.
NOTE: The code is protected under a SERVER_KNOB with the default
value as 'false' for now.
Patch updates code documentation to reflect the recent code
refactoring where ClusterController process drives recovery
instead of sequencer/master process.
* Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine""
Major changes includes:
1. Re-revert Sequencer refactor commits listed below (in listed order):
1.a. This reverts commit bb17e194d9c9888e203421290959bd7f2c075d7f.
1.b. This reverts commit d174bb2e06bff01157d16c652073536c54d17f7f.
1.c. This reverts commit 30b05b469c87d9b526b427751c211fb5cf7ff9cd.
2. Update Status.actor to track ClusterController interface to track
recovery status.
3. Introduce a ServerKnob to define "cluster recovery trace event"
prefix; for now keeping it as "Master", however, it should allow
smooth transition to "Cluster" prefix as it seems more appropriate.
diff-1: Address Jingyu's review comments
diff-2: Introduce ClusterRecovery actor to seperate out
cluster recovery code
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
responsible to recruit all other processes as well restore the
cluster state.
Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.
Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
process like other worker processes compared to current scheme
where "sequencer" process gets special treatment. In newer scheme
sequencer is responsible for maintaining/providing
"committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
the sequencer though orchestrating the recovery state machine, it
need to reachout to the ClusterController for recruiting worker
processes etc.
NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.
Next Steps:
Cluster recovery documentation will be updated in near future.
diff-1: Address Jingyu's review comments
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
responsible to recruit all other processes as well restore the
cluster state.
Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.
Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
process like other worker processes compared to current scheme
where "sequencer" process gets special treatment. In newer scheme
sequencer is responsible for maintaining/providing
"committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
the sequencer though orchestrating the recovery state machine, it
need to reachout to the ClusterController for recruiting worker
processes etc.
NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.
Next Steps:
Cluster recovery documentation will be updated in near future.