249 Commits

Author SHA1 Message Date
Andrew Noyes
6207d724f8 Fix all -Wunused-variable warnings 2019-04-15 18:13:00 -07:00
mpilman
1c16f87a4e Remove trace-calls to printable (in non-workloads) 2019-04-05 13:12:19 -07:00
mpilman
c008e16c81 Defer formatting in traces to make them cheaper
This is the first part of making `TraceEvent` cheaper. The main idea is
to defer calls to any code that formats string. These are the main
changes:

- TraceEvent::detail now takes a c-string instead of std::string for
  literals. This prevents unnecessary allocations if the trace is not
  going to be printed in the first place (for example for SevDebug).
  Before that `detail` expected a `std::string` as key, which mean that
  any string literal would be copied on each call.
- Templates Traceable and SpecialTraceMetricType. These templates can be
  specialized for any type that needs to be printed. The actual
  formatting will be deferred to after the `enabled` check. This
  provides two benefits: (1) if a TraceEvent is disabled, we don't pay
  for the formatting and (2) TraceEvent can trace types that it doesn't
  know about.
- TraceEvent::enabled will be set in the constructor if the Severity is
  passed. This will make sure that `TraceEvent::init` is not called.
- `TraceEvent::detail` will be inlined. So for disabled TraceEvent
  calls, a call to detail will only introduce a if-branch which is much
  cheaper than a function call.
2019-04-05 13:12:19 -07:00
Evan Tschannen
8ebf771392 cleanup cluster controller trace events 2019-03-30 14:17:18 -07:00
A.J. Beamon
71e2fdafb8 Changes to ratekeeper camel case 2019-03-27 08:24:25 -07:00
Evan Tschannen
5e03e178de
Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues
Add support for a client or worker having multiple issues.
2019-03-24 17:27:50 -07:00
Evan Tschannen
d45159ebf7
Merge pull request #1307 from jzhou77/ratekeeper
Monitor placement of Ratekeeper and DataDistributor
2019-03-24 17:26:07 -07:00
Evan Tschannen
d6ad027d37 ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one 2019-03-24 16:48:24 -07:00
Evan Tschannen
f426d732ea fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper 2019-03-24 16:04:59 -07:00
Evan Tschannen
e8948726e8 once we recruit a ratekeeper, do not allow any other ratekeepers to register 2019-03-24 11:04:39 -07:00
Jingyu Zhou
40eec20252 Restore master PID in worker registration
This fix is lost during merge.
2019-03-23 21:02:11 -07:00
Jingyu Zhou
3ef26e6be3 Fix fitness assignment statements
Found by MacOS build.
2019-03-23 19:16:04 -07:00
Evan Tschannen
1fc6937802 changed NetworkAddressList to at most two addresses for performance 2019-03-23 17:54:46 -07:00
Evan Tschannen
b51a24453e the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles
excluded data distributor and ratekeeper were improperly killed when the best option was also excluded
2019-03-23 13:25:36 -07:00
Jingyu Zhou
fdc5b5ddbf Fix: spurious ratekeeper registration
A rare race condition:
-r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on

- A is the ratekeeper.
- CC recruit B and B starts
- CC halts ratekeeper A and A is halted
- A registers back with CC, which then halts B. CC sets A to be the ratekeeper.

CC starts recruiting and finds A is the best machine. But skips recruiting
because CC thinks A is already used. Now the cluster is left with no ratekeeper.

Fix by disallowing ratekeeper registration with previous ID.
2019-03-23 11:03:51 -07:00
Jingyu Zhou
6523cd4931 Fix: recruit ratekeeper is not triggerred 2019-03-23 09:20:54 -07:00
Evan Tschannen
2da46e3172 fix: halt if datacenters are different 2019-03-22 23:53:21 -07:00
Evan Tschannen
d34c56c9a5 ensure that the processId exists in id_worker before accessing it 2019-03-22 18:54:39 -07:00
Evan Tschannen
36ab852bb1 Merge branch 'master' into ratekeeper
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen
ddb6058770 simplified ratekeeper monitoring loop 2019-03-22 18:22:45 -07:00
Jingyu Zhou
12917d8c7d Add actors to store halt request futures
Address best fitness in checking better DD or RK.
2019-03-22 18:06:38 -07:00
Jingyu Zhou
e8977aeb98 Remove clusterControllerDcId check
This is no longer needed since it'll be set in the ctor.
2019-03-22 18:01:54 -07:00
Evan Tschannen
82bc447e29 startRatekeeper is responsible for updating serverDBInfo 2019-03-22 17:56:16 -07:00
Evan Tschannen
82c80c225d make sure id_worker is updated before setting ratekeeper or data distribution 2019-03-22 17:08:54 -07:00
Evan Tschannen
6a9c9d79cc
Update fdbserver/ClusterController.actor.cpp 2019-03-22 17:00:58 -07:00
Evan Tschannen
70b1c88cdd
Update fdbserver/ClusterController.actor.cpp 2019-03-22 17:00:52 -07:00
Jingyu Zhou
16f54577ee Restore master PID in cluster controller worker registration
CC may think master failed and clear the master PID, which can block both data
distributor and ratekeeper recruitment. Fix by restoring it during worker
registration.
2019-03-22 14:53:05 -07:00
A.J. Beamon
4eb5715689 Add support for a client or worker having multiple issues. 2019-03-22 08:29:41 -07:00
Jingyu Zhou
da338c3ad6 Avoid unnecessary recuriting of DD or RK
While waiting for recruting data distributor or ratekeeper, a previous one
could already joined. So we can skip this unnecessary recruiting.

Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting
ratekeeper should avoid the process with an existing one. This fixes a bug
where the ratekeeper interface became zombie, killing other healthy ratekeeper
but doing no useful work. Found by:

-r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on
2019-03-21 22:40:07 -07:00
Evan Tschannen
fe4464e786 fix: processClassFitness could be wrong if the client changed their class while rebooting 2019-03-21 17:56:04 -07:00
Jingyu Zhou
299961aecb Move ratekeeper or data distributor from excluded servers 2019-03-21 17:17:33 -07:00
Jingyu Zhou
48324ad4be Fix a race during ratekeeper registration
When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new
ratekeeper. Adding a 0s delay to avoid this.

If a ratekeeper is recruited on an existing machine, update the interface so
that the cluster controller can clear the ratekeeperID.
2019-03-21 12:56:56 -07:00
Evan Tschannen
e692f0f70f fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles 2019-03-21 11:23:49 -07:00
Jingyu Zhou
8edefda193 Fix test stuck due to invalid worker in cluster controller
Test case:
-r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off
2019-03-20 22:24:01 -07:00
Jingyu Zhou
937b6dde31 Fix a race of DD, RK, Master failure
If all DD, RK, Master run on the same process and failed. Recruiting of new
DD or RK could try to use the old master worker interface, which is an invalid
one and causes recruitment to be stuck.

Fix by adding a delay and checking master is valid before recruitment.
2019-03-20 16:19:20 -07:00
Jingyu Zhou
ce5c6d18d2 Fix ratekeeper recruitment bug 2019-03-20 14:22:22 -07:00
Jingyu Zhou
86b687981b Fix ratekeeper and data distributor recruiting bug
Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag.
Fix endless recruiting when the chosen worker is a proxy or a resolver --
prefer master in this case.
2019-03-20 10:00:31 -07:00
Jingyu Zhou
474abd81bd Move placement monitoring inside doCheckOutstandingRequests 2019-03-19 22:48:21 -07:00
Balachandar Namasivayam
f9560e1abd Addressed Review Comments 2019-03-19 15:23:14 -07:00
Jingyu Zhou
bc6fdaea3e Recruit a new ratekeeper before halting the old 2019-03-19 15:21:46 -07:00
Jingyu Zhou
0fb6a03c07 First round of review comment fixes for PR#1307 2019-03-19 11:29:19 -07:00
Jingyu Zhou
8d609eb51d Protect ratekeeper registration race during recruitment
This is similar one to DataDistributor.
2019-03-18 13:53:50 -07:00
Balachandar Namasivayam
5471725db5 Support config where the primary and remote DC's can be used as satellites. 2019-03-18 12:17:59 -07:00
Jingyu Zhou
2b41a97a6e Fix the issue of slow dying Data Distributor
Test with:
-r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on

The test has the following event sequence:
- Time 113.3s, CC noticed DD failure, cleard DD interface.
- 1s later, DD rejoined and registered with CC.
- Time 131.7s, DD actor cancelled. This old DD raced to register with CC and
the failure monitor is not installed because monitorDataDistributor is stalled
waiting for new DD.
- Time 161.4s, new DD running. New DD recruting was delayed due to no servers
in the period.

Fix by disabling DD registration during the recruting process.
2019-03-17 22:19:23 -07:00
Jingyu Zhou
254c78053c Fix a segfault error
After wait, ServerDBInfo may have changed. Using the old copy is wrong.
2019-03-15 22:11:13 -07:00
Jingyu Zhou
12ddd56698 Fix Ratekeeper and DataDistributor placement
Make sure both RateKeeper and DataDistributor are placed in the same data
center as the Master. Make sure only one RateKeeper is live in the cluster as
well.
2019-03-15 17:09:28 -07:00
Jingyu Zhou
bb5686eb75 Fix monitoring of DD and RK 2019-03-15 16:02:17 -07:00
Jingyu Zhou
9f6fe5f649 Merge remote-tracking branch 'apple/master' into ratekeeper 2019-03-15 11:30:04 -07:00
Jingyu Zhou
40860e0093 Attempt to fix. 2019-03-15 11:29:04 -07:00
Jingyu Zhou
99d521ef4f Monitor Ratekeeper and DataDistributor to use stateless processes
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.

This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00