foundationdb

mirror of https://github.com/apple/foundationdb.git synced 2025-05-16 19:02:20 +08:00

Author	SHA1	Message	Date
Jingyu Zhou	937b6dde31	Fix a race of DD, RK, Master failure If all DD, RK, Master run on the same process and failed. Recruiting of new DD or RK could try to use the old master worker interface, which is an invalid one and causes recruitment to be stuck. Fix by adding a delay and checking master is valid before recruitment.	2019-03-20 16:19:20 -07:00
Jingyu Zhou	ce5c6d18d2	Fix ratekeeper recruitment bug	2019-03-20 14:22:22 -07:00
Jingyu Zhou	86b687981b	Fix ratekeeper and data distributor recruiting bug Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag. Fix endless recruiting when the chosen worker is a proxy or a resolver -- prefer master in this case.	2019-03-20 10:00:31 -07:00
Jingyu Zhou	474abd81bd	Move placement monitoring inside doCheckOutstandingRequests	2019-03-19 22:48:21 -07:00
Balachandar Namasivayam	f9560e1abd	Addressed Review Comments	2019-03-19 15:23:14 -07:00
Jingyu Zhou	bc6fdaea3e	Recruit a new ratekeeper before halting the old	2019-03-19 15:21:46 -07:00
Jingyu Zhou	0fb6a03c07	First round of review comment fixes for PR#1307	2019-03-19 11:29:19 -07:00
Jingyu Zhou	8d609eb51d	Protect ratekeeper registration race during recruitment This is similar one to DataDistributor.	2019-03-18 13:53:50 -07:00
Balachandar Namasivayam	5471725db5	Support config where the primary and remote DC's can be used as satellites.	2019-03-18 12:17:59 -07:00
Jingyu Zhou	2b41a97a6e	Fix the issue of slow dying Data Distributor Test with: -r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on The test has the following event sequence: - Time 113.3s, CC noticed DD failure, cleard DD interface. - 1s later, DD rejoined and registered with CC. - Time 131.7s, DD actor cancelled. This old DD raced to register with CC and the failure monitor is not installed because monitorDataDistributor is stalled waiting for new DD. - Time 161.4s, new DD running. New DD recruting was delayed due to no servers in the period. Fix by disabling DD registration during the recruting process.	2019-03-17 22:19:23 -07:00
Jingyu Zhou	254c78053c	Fix a segfault error After wait, ServerDBInfo may have changed. Using the old copy is wrong.	2019-03-15 22:11:13 -07:00
Jingyu Zhou	12ddd56698	Fix Ratekeeper and DataDistributor placement Make sure both RateKeeper and DataDistributor are placed in the same data center as the Master. Make sure only one RateKeeper is live in the cluster as well.	2019-03-15 17:09:28 -07:00
Jingyu Zhou	bb5686eb75	Fix monitoring of DD and RK	2019-03-15 16:02:17 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	40860e0093	Attempt to fix.	2019-03-15 11:29:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00
Meng Xu	5a10bf5dfc	Merge branch 'master' into mengxu/tls-switch-status-PR	2019-03-14 10:35:12 -07:00
Evan Tschannen	a2108047aa	removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.	2019-03-13 13:14:39 -07:00
Evan Tschannen	e068c478b5	merge master	2019-03-12 18:31:25 -07:00
Evan Tschannen	5392742902	fixed review comments	2019-03-12 14:38:54 -07:00
Jingyu Zhou	2b0139670e	Fix review comment for PR 1176	2019-03-12 12:02:30 -07:00
Meng Xu	46f4b02807	TLS Status: Resolve review comments Use connectedCoordinatorsNumDelayed to reduce the load on cluster controller; Set connectedCoordinatorsNum to null by default for monitorLeader()	2019-03-11 17:10:08 -07:00
Evan Tschannen	1be9ae5ce3	fixed merge conflict	2019-03-08 22:51:06 -05:00
Evan Tschannen	044b6b4f8a	Merge branch 'master' into feature-degraded-tlog # Conflicts: # fdbserver/ClusterController.actor.cpp	2019-03-08 22:50:41 -05:00
Evan Tschannen	45fe6b369b	tlog recruitment will prefer non-degraded processes, however it will not choose less than desired number of tlogs to avoid degraded processes better master exists will switch the master to avoid degraded processes	2019-03-08 14:40:00 -05:00
Evan Tschannen	710a64dc4e	replaced std::pair<WorkerInterface,ProcessClass> with a struct named WorkerDetails	2019-03-08 11:25:07 -05:00
Jingyu Zhou	517966fce2	Remove lastLimited from rate keeper Refactor code to make IDE happy.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	36a51a7b57	Fix a segfault bug due to uncopied ratekeeper interface	2019-03-07 13:16:20 -08:00
Jingyu Zhou	e6ac3f7fe8	Minor fix on ratekeeper work registration.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
Meng Xu	04880e3d4d	Merge branch 'master' into mengxu/tls-switch-status-PR	2019-03-06 13:41:16 -08:00
Alex Miller	c6a65389ae	Remove noexcept macro and replace with BOOST_NOEXCEPT. BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a way that is correctly maintained over time.	2019-03-05 22:06:12 -08:00
Meng Xu	820548223a	Status: connected_coordinators misc minor changes Change the rst document file; Change the coding style to be consistent with the nearby code; Ensure we always initilize the connectedCoordinatesNum to 0 even when the variable is not used.	2019-03-05 21:45:18 -08:00
Meng Xu	b7a52e81e2	Status: Count connected coordinators per client A client will always try to connect all coordinators. This commit let Status track the number of connected coordinators for each client. This allows us to do canary in coordinators. For example, when we switch from non-TLS to TLS, we can switch 1 coordinator from non-TLS to TLS. This can help check if a client has the ability to connect through TLS. We can make the non-TLS to TLS switch for each coordinators one by one. This avoid the risk of losing connection in the switch.	2019-03-05 21:21:23 -08:00
Meng Xu	c0535c49bb	Status: TLS client status Use ClientStatusInfo structure for each network address (client), instead of passing each status info as a parameter.	2019-03-04 16:35:10 -08:00
Meng Xu	94385447bc	Status: Get if client configured TLS To understand if all clients have configured TLS, we check the tlsoption when a client tries to open database. This is similar to how we track the versions of multi-version clients.	2019-03-01 15:17:01 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	d4737fac0f	knobify force recovery recovery check delay	2019-02-19 16:05:20 -08:00
mpilman	3f0fd2a20c	Use fwd decls in WorkerInterface Also WorkerInterface.h -> WorkerInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	27a3153719	Use ACTOR forward declarations in MoveKeys Also MoveKeys.h -> MoveKeys.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3a0f9839b9	Fix minor IDE build errors	2019-02-19 15:16:59 -08:00
mpilman	0bb60e5a3b	Use proper fwd decl in NativeAPI Also NativeAPI.h -> NativeAPI.actor.h	2019-02-19 15:16:59 -08:00
Evan Tschannen	ed9e20ce17	forgot to fix merge conflicts	2019-02-18 17:09:55 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	8f2af8bed1	fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller. fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date. fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited	2019-02-18 14:54:28 -08:00
Vishesh Yadav	e05b53d755	Merge remote-tracking branch 'apple/master' into task/tls-upgrade	2019-02-15 20:37:07 -08:00
Jingyu Zhou	5e6577cc82	Final cleanup per review comments Make distributor interface optional in ServerDBInfo and many other small changes.	2019-02-14 16:37:17 -08:00
Evan Tschannen	171a69c810	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Evan Tschannen	a4b2c9ef88	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0e47912192	Fix an out-of-memory error	2019-02-14 16:37:16 -08:00

... 7 8 9 10 11 ...

615 Commits