249 Commits

Author SHA1 Message Date
Jingyu Zhou
ef868f599c Add DataDistributorInterface to ServerDBInfo
Also change the Proxy and QuietDatabase to use the DataDistributorInterface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
0490160714 Fix according to Evan's comments
Use getRateInfo's endpoint as the ID for the DataDistributorInterface.
For now, added a "rejoined" flag for ClusterControllerData and Proxy.

TODO: move DataDistributorInterface into ServerDBInfo.
2019-02-14 16:30:13 -08:00
Evan Tschannen
1818aab205 Apply suggestions from code review
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:30:13 -08:00
Jingyu Zhou
886e7ab2ba Add a new DataDistributor role.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId

Add DataDistributor rejoin handling.

This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.

If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.

The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.

Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
2019-02-14 16:30:13 -08:00
Vishesh Yadav
907446d0ce Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-14 11:37:38 -08:00
A.J. Beamon
b435d51061 Merge branch 'master' into track-server-request-latencies 2019-02-14 08:07:32 -08:00
Evan Tschannen
e9ddd94e27 The failure monitor is given a list of all IP addresses associated with a process
The connect packet includes the correct remote address
Did a lot of code cleanup
Simulation test mixed TLS and non-TLS listeners on the same process
2019-01-31 18:20:14 -08:00
Evan Tschannen
a678f778fa
Increase severity to SevWarnAlways for TooManyStatusRequests trace
Co-Authored-By: tclinken <trevorclinkenbeard@gmail.com>
2019-01-28 17:50:50 -08:00
Trevor Clinkenbeard
5b89db811a Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used. 2019-01-28 15:37:30 -08:00
Evan Tschannen
1d7fec3074 Merge commit '048bfc5c368063d9e009513078dab88be0cbd5b0' into task/tls-upgrade-2
# Conflicts:
#	.gitignore
2019-01-24 17:43:06 -08:00
A.J. Beamon
2198d24ce1 Merge commit '3b2700d25334c53d13496ca16682642aac951beb' into track-server-request-latencies
# Conflicts:
#	fdbclient/MasterProxyInterface.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/ServerDBInfo.h
#	fdbserver/Status.actor.cpp
#	fdbserver/fdbserver.vcxproj
#	fdbserver/storageserver.actor.cpp
2019-01-24 11:43:26 -08:00
A.J. Beamon
8e05e95045 Added the ability to configure the latency band settings by setting a special key in \xff keyspace. 2019-01-18 16:18:34 -08:00
Evan Tschannen
7dbf06162e
Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: bnamasivayam <36455962+bnamasivayam@users.noreply.github.com>
2019-01-14 16:57:00 -08:00
Balachandar Namasivayam
ff661bca22 Fix a minor bug in the RoleFitness Class. 2019-01-14 14:54:54 -08:00
Balachandar Namasivayam
a8e2e75cd5 Re-enable CheckDesiredClasses after making necessary changes for multi-region setup.
Fixed a couple of bugs
1) A rare race condition where a worker is being roles even after it died.
2) Fix how RoleFitness is calculated for TLog and LogRouter. Only worst fitness is compared to see if a better fit is available.
2019-01-10 10:28:32 -08:00
Vishesh Yadav
3eb9b23024 Listen to multiple addresses and start using vector<NetworkAdddress> in Endpoint
- This patch will make FDB listen to multiple addresses given via
  command line. Although, we'll still use first address in most places,
  this patch starts using vector<NetworkAddress> in Endpoint at some basic
  places.
- When sending packets to an endpoint, pick a random network address in
  endpoints
- Renames Endpoint::address to Endpoint::addresses since it
  now holds a vector of addresses.
2018-12-13 13:36:52 -08:00
Vishesh Yadav
43e5a46f9b Change Endpoint::address(NetworkAddress) to vector<NetworkAddress>
Extend `Endpoint` class to take multiple NetworkAddresses instead of
just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll
have multiple IP:PORT pairs.

This patch simply adds the field and makes changes to compile the
codebase. The first element of of `address` field is used everywhere.
Hence the way we talk to remains same with this patch.

NOTE:

Directly accessing the first memeber of Endpoint::address is unsafe
as Endpoint() doesn't enforces non-empty address list. However, since
the correctness test pass for now and are anyway replacing all those
unsafe accesses with ones considering the whole vector, this patch
ignores to access them in safe way.
2018-12-13 13:36:52 -08:00
Evan Tschannen
4b5d0b4e2c Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/AsyncFileBlobStore.actor.cpp
#	fdbclient/AsyncFileBlobStore.actor.h
#	fdbclient/BlobStore.actor.cpp
#	fdbclient/BlobStore.h
#	fdbclient/HTTP.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/batcher.actor.h
#	fdbrpc/fdbrpc.vcxproj
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen
04fa2a7202 fix: we could recover in a region with priority < 0 2018-11-05 10:14:26 -08:00
Evan Tschannen
87295cc263 suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created 2018-11-04 23:07:56 -08:00
Evan Tschannen
c1bd279a4e addressed review comments 2018-11-04 20:26:23 -08:00
Evan Tschannen
accba4fa1d keep track of the last time a process became available to set a better starting value for remoteStartTime 2018-11-04 14:33:03 -08:00
Evan Tschannen
30fbc29af1 Renamed TimeKeeperStarted to TimeKeeperCommit 2018-11-02 12:57:03 -07:00
Evan Tschannen
278dbd5096 call debug transaction on timekeeper 2018-11-02 12:56:29 -07:00
Robert Escriva
268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen
3922e477a5 Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/LogSystemDiskQueueAdapter.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen
3fdf72c626 fix: we need to force recovery if the master is still attempting to read the txs tag 2018-09-28 13:33:33 -07:00
Evan Tschannen
22e6afbb18 fix: the cluster controller did not pass in its own locality when creating its database object, therefore it was not using locality aware load balancing 2018-09-28 12:12:06 -07:00
A.J. Beamon
92990d6aef Merge release-6.0 into master 2018-09-21 16:14:39 -07:00
Evan Tschannen
6b6d7a087d The cluster controller should never consider itself as failed (that will be handled by the coordinators)
Simplified the check that the cluster controller is excluded
2018-09-20 17:01:11 -07:00
Evan Tschannen
03728db99b do not trigger better master exists if the cluster controller is excluded, since the master will change anyways once the cluster controller is moved 2018-09-19 18:28:24 -07:00
Evan Tschannen
4dd2dda0a3 Merge branch 'release-6.0'
# Conflicts:
#	fdbserver/worker.actor.cpp
2018-09-05 16:11:06 -07:00
Evan Tschannen
df406a340e
Merge pull request #742 from ajbeamon/roles-in-trace-events
Add the roles running on a process as a field on trace events in the …
2018-09-05 16:08:12 -07:00
Evan Tschannen
90301f497f Merge branch 'release-6.0'
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbrpc/TLSConnection.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/Status.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/StatusWorkload.actor.cpp
#	versions.target
2018-09-05 16:06:33 -07:00
A.J. Beamon
2de0b5d6d7 Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations. 2018-09-05 15:06:14 -07:00
Evan Tschannen
4eaff42e4f
Merge pull request #712 from ajbeamon/remove-database-name-internal
Eliminate use of database names (phase 1)
2018-09-05 10:35:00 -07:00
Evan Tschannen
e60c668853 The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries 2018-08-31 10:51:55 -07:00
Evan Tschannen
a9987202d6 fixed merge problem 2018-08-22 08:47:47 -07:00
Evan Tschannen
717c43a69f merge 6.0 into master 2018-08-22 00:28:04 -07:00
A.J. Beamon
2a97139d5d This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated. 2018-08-16 10:24:12 -07:00
Evan Tschannen
e770629229 fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task. 2018-08-15 19:39:06 -07:00
Alex Miller
86dbe1f0e9 Fix more instances of actorcompiler.h being in the wrong place. 2018-08-14 15:50:26 -07:00
Alex Miller
fb31a6999f Rewrite all files to have #include actorcompiler.h as the last include. 2018-08-14 15:50:26 -07:00
Alex Miller
535b5701e5 Rewrite all Void _ = wait(...) -> wait(...).
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
A.J. Beamon
574c5576a2 Merge branch 'release-6.0' of github.com:apple/foundationdb
# Conflicts:
#	fdbrpc/TLSConnection.actor.cpp
#	versions.target
2018-08-10 14:31:58 -07:00
A.J. Beamon
3535ddad80
Merge pull request #674 from alexmiller-apple/glibcxx-debug-fixes
Fix bugs uncovered by -D_GLIBCXX_DEBUG
2018-08-09 08:18:51 -07:00
Evan Tschannen
be1a4d74c7 tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Alex Miller
1a7cda4149 Stop performing self-moves. (e.g. a = std::move(a))
self-moves are frowned upon in C++, and in our code this generally happens from
calls to swap as part of trying to implement a "unordered erase" function via
swap-to-the-end-and-pop_back.  For convenience, a swapAndPop() function is now
offered that performs this, while disallowing self-moves.
2018-08-01 18:09:54 -07:00
Evan Tschannen
1c29275672 call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details. 2018-08-01 14:30:57 -07:00
Evan Tschannen
30b2f85020 fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs 2018-07-14 16:26:45 -07:00