93 Commits

Author SHA1 Message Date
Meng Xu
cf935ff9e6 Remove debug message and format code 2019-07-11 22:05:20 -07:00
Meng Xu
cd28a0b604 Reenable check each server must have at least 1 team 2019-07-11 17:58:14 -07:00
Meng Xu
221e6945db TeamTracker:Fix bug in counting optimalTeamCount
When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover,
we should decrease the optimalTeamCount if the team is considered as an
optimal team, i.e., all members' machine fitness is no worse than unset, and
the team is healthy.
2019-07-11 17:22:41 -07:00
Meng Xu
4c32593f59 QuietDB:Do not check when machineId is not zoneID 2019-07-11 10:37:16 -07:00
Meng Xu
4fae510633 AddBestMachineTeams:BugFix:Must build team when it has remainingMachineTeamBudget 2019-07-10 11:55:06 -07:00
Meng Xu
9816fb6aca ConsistencyCheck:Check minServerTeamOnServer larger than 0 2019-07-10 11:53:47 -07:00
Meng Xu
522230f050 ConsistencyCheck:getTeamCollectionValid tries 10 times before return false
Because serverTeamRemover takes time to remove teams,
getTeamCollectionValid() need to wait for a while before concluding that
the number of server teams is larger than the desired number.
2019-07-09 11:46:57 -07:00
Meng Xu
cf03b274a2 TeamTracker:Add traceTeamCollectionInfo 2019-07-08 23:01:25 -07:00
Meng Xu
08d76a7bbe ServerTeamRemover:Bug fix and clang-format 2019-07-08 17:08:32 -07:00
Meng Xu
9cc11e88c5 TeamBuilder:Reduce unnecessary calculation of remainingTeamBudget 2019-07-08 16:56:06 -07:00
Meng Xu
08a721b320 Merge branch 'master' into mengxu/server-team-remover-PR 2019-07-08 16:30:32 -07:00
A.J. Beamon
2a56e011ea Merge branch 'release-6.1' into merge-release-6.1-into-master
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
2019-07-05 13:52:29 -07:00
Meng Xu
599fcb2e6d Add serverTeamRemover to remove redundant server teams 2019-07-02 17:40:37 -07:00
Meng Xu
716494ed9f ConsistencyCheck:Check serverTeamNumber larger than desired number 2019-07-02 17:40:37 -07:00
Meng Xu
875cb877ac TeamCollection: Apply clang-format 2019-06-28 16:01:05 -07:00
Meng Xu
0baae134f6 TeamCollectionInfo: Resolve review comments 2019-06-28 15:59:47 -07:00
Meng Xu
4da345f7d2 TeamCollectionTest:Remove test on minTeamOnServer 2019-06-27 19:05:10 -07:00
Meng Xu
f889843332 Change traceTeamCollectionInfo to actor
There are cases where traceTeamCollectionInfo was called within the same execution block, i.e.,
no wait between the two traceTeamCollectionInfo calls.
Because simulation uses the same time for all execution instructions in the same execution block,
having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics.
When one of them is always chosen by simulator, simulation test will report false positive error.

Changing this function to actor and adding a small delay inside the function can solve this problem.
2019-06-27 18:24:20 -07:00
Meng Xu
4fe3c7f749 TeamCollectionInfo:Revert to original version where it is 2019-06-27 17:09:21 -07:00
Meng Xu
42620e4831 TeamCollectionTest:GetTeamCollectionValid wait until values are correct 2019-06-27 16:52:36 -07:00
Meng Xu
8d5e848808 QuitDatabase test: Check each server has at least 1 team 2019-06-27 14:22:41 -07:00
Meng Xu
53324e4db7 TeamCollectionInfo: clang format 2019-06-27 11:27:29 -07:00
Meng Xu
cc6a0e9bcd TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0
In ConfigureTest, one server may be left with 0 server teams, even if
we call buildTeams in the storageServerTracker.
2019-06-27 11:27:29 -07:00
Meng Xu
02cdcc0b0c TeamCollectionTest: Only ensure each server and machine have a team 2019-06-27 11:27:29 -07:00
Meng Xu
21664742a6 TeamCollection:Desired team number may be larger than the max possible team number
For example, we have 3 servers for replica factor 3. We can have only 1 team
but the desired team number is 3 times 5 equal to 15.

Instead of sanity checking the absolute team number per server, we check
the difference between the minServerTeamOnServer and maxServerTeamOnServer.
2019-06-27 11:15:06 -07:00
Meng Xu
08f28e99f9 TeamCollection:Test no server or machine has incorrect team number
Add test for simulation test which make sure the server team number
per server will be no less than the desired_teams_per_server defined
in knobs and no larger than the max_teams_per_server.

Add similar test for machine teams number per machine as well.
2019-06-27 11:15:06 -07:00
A.J. Beamon
f417e60264 Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation
# Conflicts:
#	fdbserver/QuietDatabase.actor.cpp
2019-05-23 09:52:00 -07:00
A.J. Beamon
d29c7e4c9b Merge branch 'release-6.1' into merge-release-6.1-into-master
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/QuietDatabase.actor.cpp
#	versions.target
2019-05-23 09:28:45 -07:00
Evan Tschannen
f4b18f2c4f fixed whitespace 2019-05-21 11:31:34 -07:00
Evan Tschannen
23091a7d96 fixed review comments 2019-05-21 10:53:36 -07:00
Evan Tschannen
4059d68348 fix: the tlog would not pop data from the disk queue after a storage server was removed, because the tag still exists in memory on the logs
fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData
the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue
2019-05-20 23:58:45 -07:00
A.J. Beamon
5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
Austin Seipp
bf378952cb fdbserver: fix some print/scan format warnings
Signed-off-by: Austin Seipp <aseipp@pobox.com>
2019-05-06 13:35:29 -07:00
Evan Tschannen
710a64dc4e replaced std::pair<WorkerInterface,ProcessClass> with a struct named WorkerDetails 2019-03-08 11:25:07 -05:00
Evan Tschannen
d008de576e
Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR
Add background actor to remove redundant teams
2019-02-22 14:22:07 -08:00
mpilman
999ea09bfd Use correct fwd decls in TesterInterface
Also TesterInterface.h -> TesterInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman
3f0fd2a20c Use fwd decls in WorkerInterface
Also WorkerInterface.h -> WorkerInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman
0bb60e5a3b Use proper fwd decl in NativeAPI
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
mpilman
3cb2391b58 use proper fwd declarations in ManagementAPI
Also ManagementAPI.h -> ManagementAPI.actor.h
2019-02-19 15:16:59 -08:00
Meng Xu
ed1d4635bc TeamRemover: Format cleaning
Use clang-format and remove debug messages for the code
that fixes bugs in merging the PR of adding a
DataDistributor role
2019-02-19 08:13:10 -08:00
Meng Xu
b35631365f TeamRemover: Solve confict when merge with PR 1061
The previous commit merge with the master, which just merges
the pull request #1062 from jzhou77/PR that adds a new DataDistribution role.

The merge causes conflicts and errors in simulation tests.

This commit resolves the code conflicts and
tries to fix the new errors after incorporating the new DataDistribution role
2019-02-19 08:13:10 -08:00
Meng Xu
6d09ac483c Merge with master 2019-02-15 17:03:40 -08:00
Jingyu Zhou
5e6577cc82 Final cleanup per review comments
Make distributor interface optional in ServerDBInfo and many other small
changes.
2019-02-14 16:37:17 -08:00
Jingyu Zhou
07dab56133 Fix a data movement stuck bug
When moving keys to a team, if one of the server in the target team died, then
the move can become stuck. This is because the DDTeamCollection waits for all
the data movement of the failed server to be completed. However, in this case,
because the movement has not finished yet, checking the database tells us there
is no key assocated with this server and it is safe to go ahead. In reality,
only the in-memory structure knows there is pending movement, i.e., unfinished
move causes some keys to be attributed to the failed server. Thus, the server
can't be removed yet. Fix by adding a check with in-memory structure in
waitForAllDataRemoved().

Use const& to optimize a few function parameters.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
b3d1633114 Fix bugs of missing request
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
3135f1d84b Cluster controller ignores distrobutor rejoin
After controller starts one, it will wait for that one and ignore any rejoins
received later.

Add remoteRecovered() to data distribution for remote team collection.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
ef868f599c Add DataDistributorInterface to ServerDBInfo
Also change the Proxy and QuietDatabase to use the DataDistributorInterface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou
0490160714 Fix according to Evan's comments
Use getRateInfo's endpoint as the ID for the DataDistributorInterface.
For now, added a "rejoined" flag for ClusterControllerData and Proxy.

TODO: move DataDistributorInterface into ServerDBInfo.
2019-02-14 16:30:13 -08:00
Jingyu Zhou
886e7ab2ba Add a new DataDistributor role.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId

Add DataDistributor rejoin handling.

This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.

If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.

The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.

Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
2019-02-14 16:30:13 -08:00
Andrew Noyes
067a445e06 Replace unused _ variables with wait(success(...)) 2019-02-12 17:30:30 -08:00