372 Commits

Author SHA1 Message Date
A.J. Beamon
4eb5715689 Add support for a client or worker having multiple issues. 2019-03-22 08:29:41 -07:00
Jingyu Zhou
da338c3ad6 Avoid unnecessary recuriting of DD or RK
While waiting for recruting data distributor or ratekeeper, a previous one
could already joined. So we can skip this unnecessary recruiting.

Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting
ratekeeper should avoid the process with an existing one. This fixes a bug
where the ratekeeper interface became zombie, killing other healthy ratekeeper
but doing no useful work. Found by:

-r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on
2019-03-21 22:40:07 -07:00
Evan Tschannen
fe4464e786 fix: processClassFitness could be wrong if the client changed their class while rebooting 2019-03-21 17:56:04 -07:00
Jingyu Zhou
299961aecb Move ratekeeper or data distributor from excluded servers 2019-03-21 17:17:33 -07:00
Jingyu Zhou
48324ad4be Fix a race during ratekeeper registration
When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new
ratekeeper. Adding a 0s delay to avoid this.

If a ratekeeper is recruited on an existing machine, update the interface so
that the cluster controller can clear the ratekeeperID.
2019-03-21 12:56:56 -07:00
Evan Tschannen
e692f0f70f fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles 2019-03-21 11:23:49 -07:00
Jingyu Zhou
8edefda193 Fix test stuck due to invalid worker in cluster controller
Test case:
-r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off
2019-03-20 22:24:01 -07:00
Jingyu Zhou
937b6dde31 Fix a race of DD, RK, Master failure
If all DD, RK, Master run on the same process and failed. Recruiting of new
DD or RK could try to use the old master worker interface, which is an invalid
one and causes recruitment to be stuck.

Fix by adding a delay and checking master is valid before recruitment.
2019-03-20 16:19:20 -07:00
Jingyu Zhou
ce5c6d18d2 Fix ratekeeper recruitment bug 2019-03-20 14:22:22 -07:00
Jingyu Zhou
86b687981b Fix ratekeeper and data distributor recruiting bug
Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag.
Fix endless recruiting when the chosen worker is a proxy or a resolver --
prefer master in this case.
2019-03-20 10:00:31 -07:00
Jingyu Zhou
474abd81bd Move placement monitoring inside doCheckOutstandingRequests 2019-03-19 22:48:21 -07:00
Balachandar Namasivayam
f9560e1abd Addressed Review Comments 2019-03-19 15:23:14 -07:00
Jingyu Zhou
bc6fdaea3e Recruit a new ratekeeper before halting the old 2019-03-19 15:21:46 -07:00
Jingyu Zhou
0fb6a03c07 First round of review comment fixes for PR#1307 2019-03-19 11:29:19 -07:00
Jingyu Zhou
8d609eb51d Protect ratekeeper registration race during recruitment
This is similar one to DataDistributor.
2019-03-18 13:53:50 -07:00
Balachandar Namasivayam
5471725db5 Support config where the primary and remote DC's can be used as satellites. 2019-03-18 12:17:59 -07:00
Jingyu Zhou
2b41a97a6e Fix the issue of slow dying Data Distributor
Test with:
-r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on

The test has the following event sequence:
- Time 113.3s, CC noticed DD failure, cleard DD interface.
- 1s later, DD rejoined and registered with CC.
- Time 131.7s, DD actor cancelled. This old DD raced to register with CC and
the failure monitor is not installed because monitorDataDistributor is stalled
waiting for new DD.
- Time 161.4s, new DD running. New DD recruting was delayed due to no servers
in the period.

Fix by disabling DD registration during the recruting process.
2019-03-17 22:19:23 -07:00
Jingyu Zhou
254c78053c Fix a segfault error
After wait, ServerDBInfo may have changed. Using the old copy is wrong.
2019-03-15 22:11:13 -07:00
Jingyu Zhou
12ddd56698 Fix Ratekeeper and DataDistributor placement
Make sure both RateKeeper and DataDistributor are placed in the same data
center as the Master. Make sure only one RateKeeper is live in the cluster as
well.
2019-03-15 17:09:28 -07:00
Jingyu Zhou
bb5686eb75 Fix monitoring of DD and RK 2019-03-15 16:02:17 -07:00
Jingyu Zhou
9f6fe5f649 Merge remote-tracking branch 'apple/master' into ratekeeper 2019-03-15 11:30:04 -07:00
Jingyu Zhou
40860e0093 Attempt to fix. 2019-03-15 11:29:04 -07:00
Jingyu Zhou
99d521ef4f Monitor Ratekeeper and DataDistributor to use stateless processes
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.

This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00
Meng Xu
5a10bf5dfc Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-14 10:35:12 -07:00
Evan Tschannen
a2108047aa removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects. 2019-03-13 13:14:39 -07:00
Evan Tschannen
e068c478b5 merge master 2019-03-12 18:31:25 -07:00
Evan Tschannen
5392742902 fixed review comments 2019-03-12 14:38:54 -07:00
Jingyu Zhou
2b0139670e Fix review comment for PR 1176 2019-03-12 12:02:30 -07:00
Meng Xu
46f4b02807 TLS Status: Resolve review comments
Use connectedCoordinatorsNumDelayed to reduce the load on cluster controller;
Set connectedCoordinatorsNum to null by default for monitorLeader()
2019-03-11 17:10:08 -07:00
Evan Tschannen
1be9ae5ce3 fixed merge conflict 2019-03-08 22:51:06 -05:00
Evan Tschannen
044b6b4f8a Merge branch 'master' into feature-degraded-tlog
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-08 22:50:41 -05:00
Evan Tschannen
45fe6b369b tlog recruitment will prefer non-degraded processes, however it will not choose less than desired number of tlogs to avoid degraded processes
better master exists will switch the master to avoid degraded processes
2019-03-08 14:40:00 -05:00
Evan Tschannen
710a64dc4e replaced std::pair<WorkerInterface,ProcessClass> with a struct named WorkerDetails 2019-03-08 11:25:07 -05:00
Jingyu Zhou
517966fce2 Remove lastLimited from rate keeper
Refactor code to make IDE happy.
2019-03-07 13:16:20 -08:00
Jingyu Zhou
36a51a7b57 Fix a segfault bug due to uncopied ratekeeper interface 2019-03-07 13:16:20 -08:00
Jingyu Zhou
e6ac3f7fe8 Minor fix on ratekeeper work registration. 2019-03-07 13:16:20 -08:00
Jingyu Zhou
3c86643822 Separate Ratekeeper from data distribution.
Add a new role for ratekeeper.

Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
Meng Xu
04880e3d4d Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-06 13:41:16 -08:00
Alex Miller
c6a65389ae Remove noexcept macro and replace with BOOST_NOEXCEPT.
BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a
way that is correctly maintained over time.
2019-03-05 22:06:12 -08:00
Meng Xu
820548223a Status: connected_coordinators misc minor changes
Change the rst document file;
Change the coding style to be consistent with the nearby code;
Ensure we always initilize the connectedCoordinatesNum to 0
even when the variable is not used.
2019-03-05 21:45:18 -08:00
Meng Xu
b7a52e81e2 Status: Count connected coordinators per client
A client will always try to connect all coordinators.
This commit let Status track the number of connected coordinators
for each client.

This allows us to do canary in coordinators. For example,
when we switch from non-TLS to TLS, we can switch 1 coordinator
from non-TLS to TLS. This can help check if a client has the ability
to connect through TLS.
We can make the non-TLS to TLS switch for each coordinators
one by one. This avoid the risk of losing connection in the switch.
2019-03-05 21:21:23 -08:00
Meng Xu
c0535c49bb Status: TLS client status
Use ClientStatusInfo structure for each network address (client),
instead of passing each status info as a parameter.
2019-03-04 16:35:10 -08:00
Meng Xu
94385447bc Status: Get if client configured TLS
To understand if all clients have configured TLS,
we check the tlsoption when a client tries to open database.
This is similar to how we track the versions of multi-version clients.
2019-03-01 15:17:01 -08:00
Evan Tschannen
b8910ba7cd Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.h
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Evan Tschannen
d4737fac0f knobify force recovery recovery check delay 2019-02-19 16:05:20 -08:00
mpilman
3f0fd2a20c Use fwd decls in WorkerInterface
Also WorkerInterface.h -> WorkerInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman
27a3153719 Use ACTOR forward declarations in MoveKeys
Also MoveKeys.h -> MoveKeys.actor.h
2019-02-19 15:16:59 -08:00
mpilman
3a0f9839b9 Fix minor IDE build errors 2019-02-19 15:16:59 -08:00
mpilman
0bb60e5a3b Use proper fwd decl in NativeAPI
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
Evan Tschannen
ed9e20ce17 forgot to fix merge conflicts 2019-02-18 17:09:55 -08:00