576 Commits

Author SHA1 Message Date
Xiaoxi Wang
d25fc4db34 add ASSERT_WE_THINK 2022-04-07 09:21:50 -07:00
Xiaoxi Wang
20fee3dd06 check pseudo locality before pop 2022-04-05 23:48:18 -07:00
sfc-gh-tclinkenbeard
a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
A.J. Beamon
250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Zhe Wu
e07ae6fdb9 Address comments 2022-02-16 15:28:56 -08:00
Zhe Wu
9da735c38e Batch empty peek reply 2022-02-16 15:28:56 -08:00
Ata E Husain Bohra
936bf5336a
Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine" (#6191)
* Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine""

Major changes includes:
1. Re-revert Sequencer refactor commits listed below (in listed order):
1.a. This reverts commit bb17e194d9c9888e203421290959bd7f2c075d7f.
1.b. This reverts commit d174bb2e06bff01157d16c652073536c54d17f7f.
1.c. This reverts commit 30b05b469c87d9b526b427751c211fb5cf7ff9cd.

2. Update Status.actor to track ClusterController interface to track
   recovery status.
3. Introduce a ServerKnob to define "cluster recovery trace event"
   prefix; for now keeping it as "Master", however, it should allow
   smooth transition to "Cluster" prefix as it seems more appropriate.
2022-01-06 12:15:51 -08:00
Aaron Molitor
30b05b469c Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit dfe9d184ff5dd66bdbbc5b984688ac3ebb15b901.
2021-12-24 11:25:51 -08:00
Aaron Molitor
d174bb2e06 Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit abd2959702b0027ab23b8d42d8082b79c3b197f3.
2021-12-24 11:25:51 -08:00
Ata E Husain Bohra
abd2959702 Refactor: ClusterController driving cluster-recovery state machine
diff-1: Address Jingyu's review comments

At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Ata E Husain Bohra
dfe9d184ff Refactor: ClusterController driving cluster-recovery state machine
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Evan Tschannen
e3819dad7c fix: If a removed tlog never attempted a queue commit, the update storage loop could get stuck waiting for queueCommittingVersion to advance 2021-11-25 09:55:01 -08:00
Evan Tschannen
964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Lukas Joswiak
e4c3f886da Fix recovery issue 2021-11-10 16:15:13 -08:00
Lukas Joswiak
15e0d5b29f Add explicit transaction options when reading cluster ID 2021-11-09 12:29:49 -08:00
Lukas Joswiak
74cf64fe0f Sync cluster ID through ServerDBInfo 2021-11-09 12:29:48 -08:00
Lukas Joswiak
4640045243 Fix rare simulation failures
When partitions appear before a cluster has fully recovered, it was
possible to have different tlogs persist different cluster IDs because
they were involved in different partitions. This would affect recovery
when a quorum was eventually reached. The solution to this is to avoid
persisting the cluster ID before a cluster has fully recovered, to make
sure all nodes agree on the cluster ID.
2021-11-09 12:29:48 -08:00
Lukas Joswiak
3988b11fd6 Cleanup 2021-11-09 12:29:48 -08:00
Lukas Joswiak
aa3383f0e3 Exclude when joining new cluster 2021-11-09 12:29:48 -08:00
Lukas Joswiak
3e2c65bb11 Allow tlog to join another cluster but retain its data 2021-11-09 12:29:48 -08:00
Lukas Joswiak
30867750b5 Add protection against storage and tlog data deletion when joining a new cluster 2021-11-09 12:29:47 -08:00
Markus Pilman
7df059570a Make sure unit tests are run often enough 2021-11-08 15:43:32 -07:00
Evan Tschannen
c615279807
Merge pull request #5720 from sfc-gh-ljoswiak/fixes/recovery-failure-fix
Fix possible recovery hang
2021-10-25 12:35:31 -07:00
Evan Tschannen
f1158371a7 Merge branch 'master' of https://github.com/apple/foundationdb into feature-range-feed
# Conflicts:
#	flow/error_definitions.h
2021-10-21 00:55:12 -07:00
Lukas Joswiak
120d99e941 Fix a recovery hang that could occur when a new recovery was started during the existing recovery 2021-10-19 17:37:14 -07:00
sfc-gh-tclinkenbeard
9e06b6e6e3 Make IClosable interface const-correct 2021-10-18 13:40:47 -07:00
Evan Tschannen
5c642f706e Merge branch 'master' of https://github.com/apple/foundationdb into feature-range-feed
# Conflicts:
#	fdbcli/fdbcli.actor.cpp
2021-10-09 19:34:16 -07:00
Xiaoge Su
abf73047ca Enforce std:: specifier rather than using namespace 2021-09-16 19:40:28 -07:00
Xiaoge Su
067c1cc55b Extract methods in LogSystem.h to corresponding cpp file 2021-09-12 14:17:19 -07:00
Evan Tschannen
ac5b580e2d Merge branch 'master' into feature-range-feed
# Conflicts:
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/StorageServerInterface.cpp
#	fdbclient/StorageServerInterface.h
#	fdbserver/ApplyMetadataMutation.cpp
#	fdbserver/TLogServer.actor.cpp
#	flow/error_definitions.h
2021-09-09 23:13:22 -07:00
Steve Atherton
deeb6b3404 Merge branch 'master' of https://github.com/apple/foundationdb into durability-bug-repro1
# Conflicts:
#	fdbserver/TLogServer.actor.cpp
2021-08-24 16:19:16 -07:00
Steve Atherton
ec0e39b40f Bug fix: Popped versions are exclusive, so after recovery a tag for which there is no longer data should be considered popped up until the version *after* recovery, indicating that data at the recovery version itself has been popped. 2021-08-24 15:16:20 -07:00
Xiaoxi Wang
a97570bd06 solve mis-spelling, trace log and format problems 2021-08-11 18:26:00 -07:00
Xiaoxi Wang
1f6cee89ab merge master, fix conflicts 2021-08-10 10:01:45 -07:00
Steve Atherton
c73e861074 Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field. 2021-08-10 01:59:28 -07:00
Steve Atherton
54c7036eaf Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field. 2021-08-10 01:52:36 -07:00
Evan Tschannen
208a5790ad fixed usage of durable version 2021-08-09 21:58:44 -07:00
Evan Tschannen
ed28aecde0 Merge branch 'master' into feature-range-feed 2021-08-09 20:40:55 -07:00
Evan Tschannen
bc9a0e1315 first attempt to add data distribution support for range feeds 2021-08-09 10:05:56 -07:00
Xiaoxi Wang
2263626cdc 200k test clean: enable remote Log pull from LogRouter 2021-08-07 09:53:32 -07:00
Xiaoxi Wang
2df0474fec merge master 2021-08-02 11:58:35 -07:00
Xiaoxi Wang
ae2268f9f2 200k simulation: check stream sequence; delay in GetMore loop 2021-08-02 10:52:24 -07:00
Xiaoxi Wang
2a88033800 clean 100k simulation test. revert changes of fdbrpc.h 2021-07-31 16:46:14 -07:00
Xiaoxi Wang
1c4bce17aa revert code refactor 2021-07-30 19:08:22 -07:00
Xiaoxi Wang
10c82b422f merge master branch 2021-07-28 14:19:46 -07:00
Xiaoxi Wang
12d4f5c261 disable streaming peek for localities < 0 2021-07-28 14:11:25 -07:00
sfc-gh-tclinkenbeard
c74047c665 Merge remote-tracking branch 'origin/master' into fix-more-clang-warnings 2021-07-28 11:51:02 -07:00
Steve Atherton
507c1f11e3 Add .log() to bare TraceEvent() invocations without any .detail()s to avoid clang-tidy warning about immediate destruction of object without use. 2021-07-26 19:55:10 -07:00
Xiaoxi Wang
c6b0de1264 problem: OOM 2021-07-26 09:36:53 -07:00
sfc-gh-tclinkenbeard
23558a5430 Fix -Wreorder-ctor warnings in TLogServer.actor.cpp 2021-07-24 23:15:22 -07:00