200 Commits

Author SHA1 Message Date
Evan Tschannen
220ede3564 fixed compile error 2020-07-20 11:35:20 -07:00
Evan Tschannen
be67e9cfc7 wait for the correct cluster controller interface before starting master recovery 2020-07-20 11:29:37 -07:00
Evan Tschannen
32c0169fc8 use the old logic for lifetime since we already have verified the cluster controller is correct 2020-07-20 10:26:47 -07:00
Evan Tschannen
6a38f81269 do not kill the master unless we have a dbInfo from the current cluster controller 2020-07-17 14:59:38 -07:00
Jingyu Zhou
5cc5d9cf1e Log peer address whose failure can cause master recovery
So when there is master recovery due to failed tlog, proxy, resolver, log
router, or resolver, we can have a trace event tells which address that the
master thinks is dead.
2020-07-10 15:57:03 -07:00
Jingyu Zhou
d883426c6a Fix spammy GotBackupProgress events
Only print this types of events during master recovery and don't log them for
backup workers.
2020-06-27 21:30:38 -07:00
Evan Tschannen
ced65cd30b finished explicitly versioning everything stored in the database 2020-05-22 17:14:21 -07:00
Markus Pilman
c2bc75516f Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles 2020-05-14 10:34:53 -07:00
Markus Pilman
5f9b127e56 Emit traces regularly about role assignment
We are currently emitting Role transition traces when a role starts and
when it ends. While this is useful for debugging, it doesn't work well
with tools that inject data and might potentially miss some trace lines.

We do decorate each trace lines with the roles assigned to that
particular process, however, this is not sufficient for tools that can
make use of the UID -> Role mapping
2020-05-08 16:27:57 -07:00
Evan Tschannen
a442565e13 more work towards shrinking locality 2020-04-18 21:29:38 -07:00
Evan Tschannen
07cc0a8d74 code cleanup 2020-04-10 17:02:11 -07:00
Evan Tschannen
a51c92854a Merge branch 'master' into feature-tree-broadcast
# Conflicts:
#	fdbserver/WorkerInterface.actor.h
#	fdbserver/worker.actor.cpp
2020-04-06 21:09:44 -07:00
Evan Tschannen
2a1bd97120 fix compilation errors 2020-04-06 20:58:43 -07:00
Evan Tschannen
477d66b46d implemented a tree broadcast for txn state message for proxies, and serverDBInfo for workers 2020-04-05 23:09:36 -07:00
Evan Tschannen
bb5799bd20
Merge pull request #2642 from xumengpanda/mengxu/new-backup-format-PR
FastRestore:Integrate with new backup format
2020-03-25 15:47:55 -07:00
Jingyu Zhou
e2f317a0da Fix a crash failure 2020-03-25 09:18:49 -07:00
Jingyu Zhou
243d078596 Fix off by one error
Epoch end version is saved version + 1, so need +1 for minBackupVersion.
2020-03-23 20:44:31 -07:00
Jingyu Zhou
90b40e1d75 Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak
Resolve Conflicts:
	fdbclient/BackupAgent.actor.h
	fdbserver/BackupWorker.actor.cpp
	fdbserver/RestoreMaster.actor.cpp
	fdbserver/masterserver.actor.cpp
2020-03-23 13:35:33 -07:00
Meng Xu
be67ab4d6a Correct comment based on review 2020-03-23 12:53:40 -07:00
Andrew Noyes
fa8eaf9810 Assert recoverAndEndEpoch does not become ready 2020-03-23 12:40:00 -07:00
Meng Xu
3f31ebf659 New backup:Revise event name and explain code 2020-03-23 10:55:44 -07:00
Jingyu Zhou
97702d91c8 Skip recruiting backup workers for older epochs before min backup version
When master starts recruiting backup workers, if there is no active backup job
or the min version of the backup job is greater than old epoch's end version,
then these old epochs can be skipped.
2020-03-21 13:44:02 -07:00
Jingyu Zhou
818072f3cb Set oldest backup epoch if not recruiting backup workers
Since tlog is not kept until backup worker has pulled mutations from it, the
old tlogs can only be displaced after oldest backup epoch equals current epoch.
So if master is not recruiting backup workers, it should set the oldest backup
epoch as the current epoch.
2020-03-20 20:16:43 -07:00
Jingyu Zhou
5359528132 Reduce a call to getLogSystemConfig() 2020-03-20 20:15:09 -07:00
Jingyu Zhou
12ed8ad536 Fix backup worker start version when logset start version is lower
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
2020-03-20 20:15:08 -07:00
Jingyu Zhou
80d3fa1222 Add delay for master to recruit backup workers
This delay is to ensure old epoch's backup workers can save their progress in
the database. Otherwise, the new master could attempts to recruit backup
workers for the old epoch on version ranges that have already been popped. As
a result, the logs will lose data.
2020-03-20 20:15:08 -07:00
Jingyu Zhou
fda6c08640 Include a total number of tags in partition log file names
This is needed for BackupContainer to check partitioned mutation logs are
continuous, i.e., restorable to a version.
2020-03-20 20:13:38 -07:00
Jingyu Zhou
5bf62c8f85 Reduce a call to getLogSystemConfig() 2020-03-19 10:08:19 -07:00
Jingyu Zhou
89d8f13038 Fix backup worker start version when logset start version is lower
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
2020-03-18 16:41:35 -07:00
Jingyu Zhou
15437ffb53 Add delay for master to recruit backup workers
This delay is to ensure old epoch's backup workers can save their progress in
the database. Otherwise, the new master could attempts to recruit backup
workers for the old epoch on version ranges that have already been popped. As
a result, the logs will lose data.
2020-03-18 16:41:35 -07:00
Jingyu Zhou
d8c6bf585d Include a total number of tags in partition log file names
This is needed for BackupContainer to check partitioned mutation logs are
continuous, i.e., restorable to a version.
2020-03-18 16:39:40 -07:00
Evan Tschannen
e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen
56dee89e6e active generations should include the current one 2020-03-16 11:09:42 -07:00
Evan Tschannen
e5d53c863b report in status the number of active generations 2020-03-16 10:29:17 -07:00
Evan Tschannen
818537ed2d
Update fdbserver/masterserver.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-03-14 15:04:46 -07:00
Evan Tschannen
2f2f56020f
Update fdbserver/masterserver.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-03-13 15:54:13 -07:00
Evan Tschannen
a39effa57d delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state 2020-03-13 10:28:32 -07:00
Evan Tschannen
96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
A.J. Beamon
df2b0452b4 Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef. 2020-02-06 13:19:24 -08:00
Jingyu Zhou
52c6737411 Rename backupLoggingEnabled as backupWorkerEnabled
To highlight the changes for 7.0 backup changes. By default,
backup_worker_enabled flag is set for 7.0 version.
2020-02-04 10:09:16 -08:00
Jingyu Zhou
0db03f1d3c Use backup_logging_enabled flag
The default is to enable new backup workers. Users can disable this flag to
turn off the backup worker feature.
2020-02-03 20:03:22 -08:00
Jingyu Zhou
38aa1903fd Add a DB configuration option for backup workers
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.

The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
2020-01-31 19:29:09 -08:00
Jingyu Zhou
8b67a89eed More review comments fixed. 2020-01-22 19:42:13 -08:00
Jingyu Zhou
1eaea91cb3 Address review comments 2020-01-22 19:42:13 -08:00
Jingyu Zhou
e14246ac16 Add more information for trace events 2020-01-22 19:42:13 -08:00
Jingyu Zhou
4bed33031f Set backup worker start version to be savedVersion + 1
If no progress found, start version is set to epochBegin. So the start version
is the one after the last saved (or from last epoch's saved) version.
2020-01-22 19:42:13 -08:00
Jingyu Zhou
4ed75e37f3 BackupProgress uses old epoch's begin version if no progress found
Get rid of the complex logic of choosing the largest saved version from
previous epoch for the oldest epoch. Instead, use the begin version now
available from log system.
2020-01-22 19:38:46 -08:00
Jingyu Zhou
19eacac3ce Add a unit test for BackupProgress 2020-01-22 19:38:46 -08:00
Jingyu Zhou
64052f6349 Check and fill backup gaps for old epochs and tags
Sometimes the backup worker has not updated progress to the system space and a
master recovery happens. As a result, next epoch doesn't know the progress of
previous ones. This change is to check for such missing gaps and fill them with
the whole range [startVersion, endVersion).

The code is refactored into BackupProgress.actor.* to consolidate backup
progress processing for the master server.
2020-01-22 19:38:46 -08:00
Jingyu Zhou
ed54aaa09e Fix a crash failure of empty backup interface 2020-01-22 19:38:46 -08:00