While waiting for recruting data distributor or ratekeeper, a previous one
could already joined. So we can skip this unnecessary recruiting.
Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting
ratekeeper should avoid the process with an existing one. This fixes a bug
where the ratekeeper interface became zombie, killing other healthy ratekeeper
but doing no useful work. Found by:
-r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on
When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new
ratekeeper. Adding a 0s delay to avoid this.
If a ratekeeper is recruited on an existing machine, update the interface so
that the cluster controller can clear the ratekeeperID.
If all DD, RK, Master run on the same process and failed. Recruiting of new
DD or RK could try to use the old master worker interface, which is an invalid
one and causes recruitment to be stuck.
Fix by adding a delay and checking master is valid before recruitment.
Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag.
Fix endless recruiting when the chosen worker is a proxy or a resolver --
prefer master in this case.
Test with:
-r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on
The test has the following event sequence:
- Time 113.3s, CC noticed DD failure, cleard DD interface.
- 1s later, DD rejoined and registered with CC.
- Time 131.7s, DD actor cancelled. This old DD raced to register with CC and
the failure monitor is not installed because monitorDataDistributor is stalled
waiting for new DD.
- Time 161.4s, new DD running. New DD recruting was delayed due to no servers
in the period.
Fix by disabling DD registration during the recruting process.
Make sure both RateKeeper and DataDistributor are placed in the same data
center as the Master. Make sure only one RateKeeper is live in the cluster as
well.
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.
This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
Add a new role for ratekeeper.
Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
Change the rst document file;
Change the coding style to be consistent with the nearby code;
Ensure we always initilize the connectedCoordinatesNum to 0
even when the variable is not used.
A client will always try to connect all coordinators.
This commit let Status track the number of connected coordinators
for each client.
This allows us to do canary in coordinators. For example,
when we switch from non-TLS to TLS, we can switch 1 coordinator
from non-TLS to TLS. This can help check if a client has the ability
to connect through TLS.
We can make the non-TLS to TLS switch for each coordinators
one by one. This avoid the risk of losing connection in the switch.
To understand if all clients have configured TLS,
we check the tlsoption when a client tries to open database.
This is similar to how we track the versions of multi-version clients.