foundationdb

mirror of https://github.com/apple/foundationdb.git synced 2025-05-15 10:22:20 +08:00

Author	SHA1	Message	Date
Andrew Noyes	6207d724f8	Fix all -Wunused-variable warnings	2019-04-15 18:13:00 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
mpilman	c008e16c81	Defer formatting in traces to make them cheaper This is the first part of making `TraceEvent` cheaper. The main idea is to defer calls to any code that formats string. These are the main changes: - TraceEvent::detail now takes a c-string instead of std::string for literals. This prevents unnecessary allocations if the trace is not going to be printed in the first place (for example for SevDebug). Before that `detail` expected a `std::string` as key, which mean that any string literal would be copied on each call. - Templates Traceable and SpecialTraceMetricType. These templates can be specialized for any type that needs to be printed. The actual formatting will be deferred to after the `enabled` check. This provides two benefits: (1) if a TraceEvent is disabled, we don't pay for the formatting and (2) TraceEvent can trace types that it doesn't know about. - TraceEvent::enabled will be set in the constructor if the Severity is passed. This will make sure that `TraceEvent::init` is not called. - `TraceEvent::detail` will be inlined. So for disabled TraceEvent calls, a call to detail will only introduce a if-branch which is much cheaper than a function call.	2019-04-05 13:12:19 -07:00
Evan Tschannen	8ebf771392	cleanup cluster controller trace events	2019-03-30 14:17:18 -07:00
A.J. Beamon	71e2fdafb8	Changes to ratekeeper camel case	2019-03-27 08:24:25 -07:00
Evan Tschannen	5e03e178de	Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues Add support for a client or worker having multiple issues.	2019-03-24 17:27:50 -07:00
Evan Tschannen	d45159ebf7	Merge pull request #1307 from jzhou77/ratekeeper Monitor placement of Ratekeeper and DataDistributor	2019-03-24 17:26:07 -07:00
Evan Tschannen	d6ad027d37	ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one	2019-03-24 16:48:24 -07:00
Evan Tschannen	f426d732ea	fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper	2019-03-24 16:04:59 -07:00
Evan Tschannen	e8948726e8	once we recruit a ratekeeper, do not allow any other ratekeepers to register	2019-03-24 11:04:39 -07:00
Jingyu Zhou	40eec20252	Restore master PID in worker registration This fix is lost during merge.	2019-03-23 21:02:11 -07:00
Jingyu Zhou	3ef26e6be3	Fix fitness assignment statements Found by MacOS build.	2019-03-23 19:16:04 -07:00
Evan Tschannen	1fc6937802	changed NetworkAddressList to at most two addresses for performance	2019-03-23 17:54:46 -07:00
Evan Tschannen	b51a24453e	the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles excluded data distributor and ratekeeper were improperly killed when the best option was also excluded	2019-03-23 13:25:36 -07:00
Jingyu Zhou	fdc5b5ddbf	Fix: spurious ratekeeper registration A rare race condition: -r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on - A is the ratekeeper. - CC recruit B and B starts - CC halts ratekeeper A and A is halted - A registers back with CC, which then halts B. CC sets A to be the ratekeeper. CC starts recruiting and finds A is the best machine. But skips recruiting because CC thinks A is already used. Now the cluster is left with no ratekeeper. Fix by disallowing ratekeeper registration with previous ID.	2019-03-23 11:03:51 -07:00
Jingyu Zhou	6523cd4931	Fix: recruit ratekeeper is not triggerred	2019-03-23 09:20:54 -07:00
Evan Tschannen	2da46e3172	fix: halt if datacenters are different	2019-03-22 23:53:21 -07:00
Evan Tschannen	d34c56c9a5	ensure that the processId exists in id_worker before accessing it	2019-03-22 18:54:39 -07:00
Evan Tschannen	36ab852bb1	Merge branch 'master' into ratekeeper # Conflicts: # fdbserver/ClusterController.actor.cpp	2019-03-22 18:41:00 -07:00
Evan Tschannen	ddb6058770	simplified ratekeeper monitoring loop	2019-03-22 18:22:45 -07:00
Jingyu Zhou	12917d8c7d	Add actors to store halt request futures Address best fitness in checking better DD or RK.	2019-03-22 18:06:38 -07:00
Jingyu Zhou	e8977aeb98	Remove clusterControllerDcId check This is no longer needed since it'll be set in the ctor.	2019-03-22 18:01:54 -07:00
Evan Tschannen	82bc447e29	startRatekeeper is responsible for updating serverDBInfo	2019-03-22 17:56:16 -07:00
Evan Tschannen	82c80c225d	make sure id_worker is updated before setting ratekeeper or data distribution	2019-03-22 17:08:54 -07:00
Evan Tschannen	6a9c9d79cc	Update fdbserver/ClusterController.actor.cpp	2019-03-22 17:00:58 -07:00
Evan Tschannen	70b1c88cdd	Update fdbserver/ClusterController.actor.cpp	2019-03-22 17:00:52 -07:00
Jingyu Zhou	16f54577ee	Restore master PID in cluster controller worker registration CC may think master failed and clear the master PID, which can block both data distributor and ratekeeper recruitment. Fix by restoring it during worker registration.	2019-03-22 14:53:05 -07:00
A.J. Beamon	4eb5715689	Add support for a client or worker having multiple issues.	2019-03-22 08:29:41 -07:00
Jingyu Zhou	da338c3ad6	Avoid unnecessary recuriting of DD or RK While waiting for recruting data distributor or ratekeeper, a previous one could already joined. So we can skip this unnecessary recruiting. Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting ratekeeper should avoid the process with an existing one. This fixes a bug where the ratekeeper interface became zombie, killing other healthy ratekeeper but doing no useful work. Found by: -r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on	2019-03-21 22:40:07 -07:00
Evan Tschannen	fe4464e786	fix: processClassFitness could be wrong if the client changed their class while rebooting	2019-03-21 17:56:04 -07:00
Jingyu Zhou	299961aecb	Move ratekeeper or data distributor from excluded servers	2019-03-21 17:17:33 -07:00
Jingyu Zhou	48324ad4be	Fix a race during ratekeeper registration When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new ratekeeper. Adding a 0s delay to avoid this. If a ratekeeper is recruited on an existing machine, update the interface so that the cluster controller can clear the ratekeeperID.	2019-03-21 12:56:56 -07:00
Evan Tschannen	e692f0f70f	fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles	2019-03-21 11:23:49 -07:00
Jingyu Zhou	8edefda193	Fix test stuck due to invalid worker in cluster controller Test case: -r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off	2019-03-20 22:24:01 -07:00
Jingyu Zhou	937b6dde31	Fix a race of DD, RK, Master failure If all DD, RK, Master run on the same process and failed. Recruiting of new DD or RK could try to use the old master worker interface, which is an invalid one and causes recruitment to be stuck. Fix by adding a delay and checking master is valid before recruitment.	2019-03-20 16:19:20 -07:00
Jingyu Zhou	ce5c6d18d2	Fix ratekeeper recruitment bug	2019-03-20 14:22:22 -07:00
Jingyu Zhou	86b687981b	Fix ratekeeper and data distributor recruiting bug Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag. Fix endless recruiting when the chosen worker is a proxy or a resolver -- prefer master in this case.	2019-03-20 10:00:31 -07:00
Jingyu Zhou	474abd81bd	Move placement monitoring inside doCheckOutstandingRequests	2019-03-19 22:48:21 -07:00
Balachandar Namasivayam	f9560e1abd	Addressed Review Comments	2019-03-19 15:23:14 -07:00
Jingyu Zhou	bc6fdaea3e	Recruit a new ratekeeper before halting the old	2019-03-19 15:21:46 -07:00
Jingyu Zhou	0fb6a03c07	First round of review comment fixes for PR#1307	2019-03-19 11:29:19 -07:00
Jingyu Zhou	8d609eb51d	Protect ratekeeper registration race during recruitment This is similar one to DataDistributor.	2019-03-18 13:53:50 -07:00
Balachandar Namasivayam	5471725db5	Support config where the primary and remote DC's can be used as satellites.	2019-03-18 12:17:59 -07:00
Jingyu Zhou	2b41a97a6e	Fix the issue of slow dying Data Distributor Test with: -r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on The test has the following event sequence: - Time 113.3s, CC noticed DD failure, cleard DD interface. - 1s later, DD rejoined and registered with CC. - Time 131.7s, DD actor cancelled. This old DD raced to register with CC and the failure monitor is not installed because monitorDataDistributor is stalled waiting for new DD. - Time 161.4s, new DD running. New DD recruting was delayed due to no servers in the period. Fix by disabling DD registration during the recruting process.	2019-03-17 22:19:23 -07:00
Jingyu Zhou	254c78053c	Fix a segfault error After wait, ServerDBInfo may have changed. Using the old copy is wrong.	2019-03-15 22:11:13 -07:00
Jingyu Zhou	12ddd56698	Fix Ratekeeper and DataDistributor placement Make sure both RateKeeper and DataDistributor are placed in the same data center as the Master. Make sure only one RateKeeper is live in the cluster as well.	2019-03-15 17:09:28 -07:00
Jingyu Zhou	bb5686eb75	Fix monitoring of DD and RK	2019-03-15 16:02:17 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	40860e0093	Attempt to fix.	2019-03-15 11:29:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00

1 2 3 4 5

249 Commits