185 Commits

Author SHA1 Message Date
Meng Xu
04880e3d4d Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-06 13:41:16 -08:00
Alex Miller
c6a65389ae Remove noexcept macro and replace with BOOST_NOEXCEPT.
BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a
way that is correctly maintained over time.
2019-03-05 22:06:12 -08:00
Meng Xu
820548223a Status: connected_coordinators misc minor changes
Change the rst document file;
Change the coding style to be consistent with the nearby code;
Ensure we always initilize the connectedCoordinatesNum to 0
even when the variable is not used.
2019-03-05 21:45:18 -08:00
Meng Xu
b7a52e81e2 Status: Count connected coordinators per client
A client will always try to connect all coordinators.
This commit let Status track the number of connected coordinators
for each client.

This allows us to do canary in coordinators. For example,
when we switch from non-TLS to TLS, we can switch 1 coordinator
from non-TLS to TLS. This can help check if a client has the ability
to connect through TLS.
We can make the non-TLS to TLS switch for each coordinators
one by one. This avoid the risk of losing connection in the switch.
2019-03-05 21:21:23 -08:00
Meng Xu
c0535c49bb Status: TLS client status
Use ClientStatusInfo structure for each network address (client),
instead of passing each status info as a parameter.
2019-03-04 16:35:10 -08:00
Meng Xu
94385447bc Status: Get if client configured TLS
To understand if all clients have configured TLS,
we check the tlsoption when a client tries to open database.
This is similar to how we track the versions of multi-version clients.
2019-03-01 15:17:01 -08:00
Evan Tschannen
b8910ba7cd Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.h
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Evan Tschannen
d4737fac0f knobify force recovery recovery check delay 2019-02-19 16:05:20 -08:00
mpilman
3f0fd2a20c Use fwd decls in WorkerInterface
Also WorkerInterface.h -> WorkerInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman
27a3153719 Use ACTOR forward declarations in MoveKeys
Also MoveKeys.h -> MoveKeys.actor.h
2019-02-19 15:16:59 -08:00
mpilman
3a0f9839b9 Fix minor IDE build errors 2019-02-19 15:16:59 -08:00
mpilman
0bb60e5a3b Use proper fwd decl in NativeAPI
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
Evan Tschannen
ed9e20ce17 forgot to fix merge conflicts 2019-02-18 17:09:55 -08:00
Evan Tschannen
065a45e05f Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen
8f2af8bed1 fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller.
fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date.

fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited
2019-02-18 14:54:28 -08:00
Vishesh Yadav
e05b53d755 Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-15 20:37:07 -08:00
Jingyu Zhou
5e6577cc82 Final cleanup per review comments
Make distributor interface optional in ServerDBInfo and many other small
changes.
2019-02-14 16:37:17 -08:00
Evan Tschannen
171a69c810 Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Evan Tschannen
a4b2c9ef88 Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Jingyu Zhou
0e47912192 Fix an out-of-memory error 2019-02-14 16:37:16 -08:00
Jingyu Zhou
62c67a50e5 Fix segfault error
The usedIds is updated by master registration request, which populates the
usedIds map. However, this request may contain processes that cluster controller
is not aware, i.e., not in id_worker map.

This is ok until I added tracing the usedIds, which silently insert an empty
entry into id_worker map for the unknown process. This new entry can cause
crashing failure when trying to access its LocalityData.

Remove AsyncTrigger for usedIds, and change to serverInfo->onChange.

Use const & to avoid unnecessary copies in WorkerInterface's LocalityData
and getExtraTLogEligibleMachines().
2019-02-14 16:37:16 -08:00
Jingyu Zhou
21066b013a Remove DataDistributorRejoinRequest
This is no longer needed, since worker registration piggybacks distributor
interface now.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
816f8b1ae1 Per review comments
Add a knob for starting distributor delay.
Move distributor failed variable to a local loop.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
578473a974 Various review comments fixes 2019-02-14 16:37:16 -08:00
Jingyu Zhou
b3d1633114 Fix bugs of missing request
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
2019-02-14 16:37:16 -08:00
Evan Tschannen
5fb48083cd Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Jingyu Zhou
8c61de318f Fix segfault and no_more_servers errors 2019-02-14 16:37:16 -08:00
Jingyu Zhou
7897616164 Fix wait failure bug on cluster controller
The setDistributor() sets an AsyncVar and then runs waitFailureClient. This
ordering is wrong because the AsyncVar::set triggers the other loop to run
first, which will wait on Never(). The correct code should wait on the Future
returned by the waitFailureClient.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
00f2253229 Piggyback data distributor interface in worker registration
This allows cluster controller to know data distributor during worker
registration phase, thus avoiding recruiting a new data distributor after
starting.

Also change the worker to skip creating a new data distributor if there is
already one running on the worker, which can trigger operation timeout in tests.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
39e4a59154 Add used worker IDs to cluster controller
This "usedIds" is updated when receiving a master registration message, so that
when recruiting new data distributor, existing assignment is known.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
6a655143e8 A follow-on fix for config key usage
And some trace event cleanups.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
be5c962bb7 Add a new configuration version key \xff/conf/version
This fixed a bug found by upgrade test, where the configuration monitor of the
data distributor was monitoring excludedServersVersionKey, which doesn't
change in ChangeConfig workload. As a result, data distributor was not aware of
configuration changes.

Adding this new key and make sure this key is updated in configuration changes
so that the monitor can detect configuration changes.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
3135f1d84b Cluster controller ignores distrobutor rejoin
After controller starts one, it will wait for that one and ignore any rejoins
received later.

Add remoteRecovered() to data distribution for remote team collection.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
e0a7162cf8 Add a failure timeout knob for data distributor.
Set default time to 1.0s.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
efd000dd11 Remove distributor interface from ClusterControllerData
This information is now kept in ServerDBInfo.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
3f7bbc68aa Remove getDistributorInterface from cluster controller 2019-02-14 16:37:16 -08:00
Jingyu Zhou
ef868f599c Add DataDistributorInterface to ServerDBInfo
Also change the Proxy and QuietDatabase to use the DataDistributorInterface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
0490160714 Fix according to Evan's comments
Use getRateInfo's endpoint as the ID for the DataDistributorInterface.
For now, added a "rejoined" flag for ClusterControllerData and Proxy.

TODO: move DataDistributorInterface into ServerDBInfo.
2019-02-14 16:30:13 -08:00
Evan Tschannen
1818aab205 Apply suggestions from code review
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:30:13 -08:00
Jingyu Zhou
886e7ab2ba Add a new DataDistributor role.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId

Add DataDistributor rejoin handling.

This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.

If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.

The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.

Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
2019-02-14 16:30:13 -08:00
Vishesh Yadav
907446d0ce Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-14 11:37:38 -08:00
A.J. Beamon
b435d51061 Merge branch 'master' into track-server-request-latencies 2019-02-14 08:07:32 -08:00
Evan Tschannen
e9ddd94e27 The failure monitor is given a list of all IP addresses associated with a process
The connect packet includes the correct remote address
Did a lot of code cleanup
Simulation test mixed TLS and non-TLS listeners on the same process
2019-01-31 18:20:14 -08:00
Evan Tschannen
a678f778fa
Increase severity to SevWarnAlways for TooManyStatusRequests trace
Co-Authored-By: tclinken <trevorclinkenbeard@gmail.com>
2019-01-28 17:50:50 -08:00
Trevor Clinkenbeard
5b89db811a Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used. 2019-01-28 15:37:30 -08:00
Evan Tschannen
1d7fec3074 Merge commit '048bfc5c368063d9e009513078dab88be0cbd5b0' into task/tls-upgrade-2
# Conflicts:
#	.gitignore
2019-01-24 17:43:06 -08:00
A.J. Beamon
2198d24ce1 Merge commit '3b2700d25334c53d13496ca16682642aac951beb' into track-server-request-latencies
# Conflicts:
#	fdbclient/MasterProxyInterface.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/ServerDBInfo.h
#	fdbserver/Status.actor.cpp
#	fdbserver/fdbserver.vcxproj
#	fdbserver/storageserver.actor.cpp
2019-01-24 11:43:26 -08:00
A.J. Beamon
8e05e95045 Added the ability to configure the latency band settings by setting a special key in \xff keyspace. 2019-01-18 16:18:34 -08:00
Evan Tschannen
7dbf06162e
Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: bnamasivayam <36455962+bnamasivayam@users.noreply.github.com>
2019-01-14 16:57:00 -08:00
Balachandar Namasivayam
ff661bca22 Fix a minor bug in the RoleFitness Class. 2019-01-14 14:54:54 -08:00