60 Commits

Author SHA1 Message Date
Meng Xu
a08a6776f5 FastRestore: Refactor to smaller components
The current code uses one restore interface to handle the work
for all restore roles, i.e., master, loader and applier.
This makes it harder to review or maintain or scale.

This commit split the restore into multiple roles by mimicing FDB
transaction system:
1) It uses a RestoreWorker as the process to host restore roles;
   This commit assumes one restore role per RestoreWorker; but
   it should be easy to extend to support multiple roles per RestoreWorker;
2) It creates 3 restore roles:
   RestoreMaster: Coordinate the restore process and send commands to the other two roles;
   RestoreLoader: Parse backup files to mutations and send mutations to appliers;
   RestoreApplier: Sort received mutations and apply them to DB in order.

Compilable version. To be tested in correctness.
2019-05-10 14:20:06 -07:00
Meng Xu
529ce66b6c Merge branch 'apple/master' into mengxu/performant-restore-PR 2019-04-18 18:02:45 -07:00
Meng Xu
092a890da5 FastRestore: Fix MacOS compilation
The bug shown in MacOS compilation may also cause logic error
in the implementation, even in Linux.
2019-04-09 22:37:24 -07:00
mpilman
1c16f87a4e Remove trace-calls to printable (in non-workloads) 2019-04-05 13:12:19 -07:00
Meng Xu
c4a8a80d6f Merge branch 'apple/master' into mengxu/performant-restore-PR 2019-04-04 22:51:00 -07:00
Meng Xu
eb1e880fef FastRestore: Rename RestoreCommandInterface
Rename it to RestoreInterface.
The new name is more general because we will have different type of
RequestStreams for each type of commands.
2019-04-04 13:52:24 -07:00
Evan Tschannen
781cf9b5a0 added the ability to make a zoneId for maintenance in fdbcli 2019-04-01 17:55:13 -07:00
Meng Xu
d68c9ec09e FastRestore: Fix after merge with master 2019-03-31 22:07:37 -07:00
Meng Xu
70d7c289f4 Merge branch 'master' into mengxu/restore/parallel-v7 2019-03-30 22:13:10 -07:00
Evan Tschannen
b6008558d3 renamed BinaryWriter.toStringRef() to .toValue(), because the function now returns a Standalone<StringRef>()
eliminated an unnecessary copy from the proxy commit path
eliminated an unnecessary copy from buffered peek cursor
2019-03-28 11:52:50 -07:00
Meng Xu
ee70bbf318 FastRestore: Correct running after refactor
Test on one test case and passed.
2019-03-14 16:45:04 -07:00
Evan Tschannen
2627bcd35e Merge branch 'master' into feature-metadata-version 2019-03-10 21:13:28 -07:00
Meng Xu
00d1e5e70a FastRestore: Add command UID and code clean
Change variable name to a shorter name
Remove most unused code
Compilable at this commit
2019-03-10 17:17:18 -07:00
Vishesh Yadav
41d18db7b9 fix: update the encoding of AddressExclusion in SystemData #963 2019-03-04 14:12:45 -08:00
Vishesh Yadav
57832e625d net: Support IPv6 #963
- NetworkAddress now contains IPAddress object which can be either
IPv4 or IPv6 address. 128bits are used even for IPv4 addresses,
however only 32bits are used when using/serializing IPv4 address.

- ConnectPacket is updated to store IPv6 address. Backward compatible
with old format since the first 32bits of IP address field is used
for serialization of IPv4.

- Mainly updates rest of the code to use IPAddress structure instead
of plain uint32_t.

- IPv6 address/pair ports should be represented as `[ip]:port` as per
convention. This applies to both cluster files and command line
arguments.
2019-03-04 14:12:41 -08:00
Evan Tschannen
3da85f3acd implemented the \xff/metadataVersion key, which can be used by layers to help them cheaply cache metadata and know when their cache is invalid 2019-02-28 17:45:00 -08:00
Evan Tschannen
3a572b010f fix: a forced recovery needed to force the data distributor to restart 2019-02-19 16:04:52 -08:00
Evan Tschannen
065a45e05f Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen
4c35ebdcc6 fix: because of forced recoveries, storage servers in remote regions cannot update their durable version to (lastLogVersion - 5e6), because the lastLogVersion might have jumped due to an epoch end and the recovery version after the forced recovery could be before the epoch end, causing the storage server to want to rollback to a version it does not have on disk 2019-02-18 14:40:30 -08:00
Evan Tschannen
05ca0a10d8 fix: kill all storage servers which are not in the safe locality after a forced recovery 2019-02-18 14:30:51 -08:00
Evan Tschannen
abc3c01fb2 Update fdbclient/SystemData.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Jingyu Zhou
be5c962bb7 Add a new configuration version key \xff/conf/version
This fixed a bug found by upgrade test, where the configuration monitor of the
data distributor was monitoring excludedServersVersionKey, which doesn't
change in ChangeConfig workload. As a result, data distributor was not aware of
configuration changes.

Adding this new key and make sure this key is updated in configuration changes
so that the monitor can detect configuration changes.
2019-02-14 16:37:16 -08:00
Meng Xu
b3f0326d81 let master wait for any applier reply at apply db
Applier may crash in applying mutations.
Node crash may make master waits infinitely for the reply from all nodes.

Change waitForAll semantics to waitForAny when waiting for the appliers response for applying mutations to DB

This is a workaround. The long-term solution should handle the failure in a better way
2019-01-31 09:14:10 -08:00
Meng Xu
a56ba2faf6 update restore status 2019-01-30 17:30:29 -08:00
Meng Xu
2e11b38f3f Add print in fast restore agent about backup info 2019-01-30 11:18:11 -08:00
A.J. Beamon
2198d24ce1 Merge commit '3b2700d25334c53d13496ca16682642aac951beb' into track-server-request-latencies
# Conflicts:
#	fdbclient/MasterProxyInterface.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/ServerDBInfo.h
#	fdbserver/Status.actor.cpp
#	fdbserver/fdbserver.vcxproj
#	fdbserver/storageserver.actor.cpp
2019-01-24 11:43:26 -08:00
A.J. Beamon
8e05e95045 Added the ability to configure the latency band settings by setting a special key in \xff keyspace. 2019-01-18 16:18:34 -08:00
Meng Xu
d9268b54e8 fast restore: add data struct and assign role to nodes
add data structure to track the status of each node
add logic to let master node assign role to loader and applier
make sure the command request and reply is correct
2018-12-20 11:40:03 -08:00
Meng Xu
1b085a9817 sequantial restore: pass 1 test case
-r simulation --logsize 1024MiB -f foundationdb/tests/fast/ParallelRestoreCorrectness.txt -b off -s 95208406
2018-12-03 10:57:30 -08:00
Evan Tschannen
4e54690005 Merge branch 'release-6.0'
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen
7892da032f fix: Do not remove the locality entry for the current transaction logs when removing storage servers
fix: dcId_locality map could be incorrect after restarting recruitEverything
2018-11-11 12:37:53 -08:00
Evan Tschannen
4b5d0b4e2c Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/AsyncFileBlobStore.actor.cpp
#	fdbclient/AsyncFileBlobStore.actor.h
#	fdbclient/BlobStore.actor.cpp
#	fdbclient/BlobStore.h
#	fdbclient/HTTP.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/batcher.actor.h
#	fdbrpc/fdbrpc.vcxproj
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen
19ae063b66 fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available 2018-11-08 15:44:03 -08:00
Robert Escriva
268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen
0acfae1e76 fixed the windows linker error 2018-10-15 18:19:51 -07:00
Evan Tschannen
4c95a5ee0f added the basic structure for parallel restore 2018-10-09 18:47:28 -07:00
John Brownlee
2beeadf8be Adds a key range for storing changes to monitor conf files. 2018-10-01 10:49:02 -07:00
Evan Tschannen
ffde1a0e28 renamed onlySystem to mustContainSystemMutations, to accurately represent what setting the key does 2018-08-21 22:15:45 -07:00
Evan Tschannen
cb60002944 Added the ability to disable all commits which do not modify the system keys by setting \xff/onlySystem = 1 in the database 2018-08-21 21:09:50 -07:00
Evan Tschannen
284233baa1 added a key in the database with the locality of the current master 2018-06-14 19:36:02 -07:00
A.J. Beamon
e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen
10d25927cd Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen
7af892f50b first working version of non-copying recovery working with fearless configurations 2018-04-08 21:24:05 -07:00
Evan Tschannen
579ba58930 pop old tags only looks are recovered tags, and checks if they are still being used 2018-03-30 19:08:01 -07:00
Yichi Chiang
26b93ff920 Share log mutations between backups and DRs which have the same backup range 2018-03-16 18:09:23 -07:00
Evan Tschannen
37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Alec Grieser
0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Evan Tschannen
29c5d4ad3d upgrades from 5.X mostly supported, still some remaining correctness problems 2018-01-28 11:52:54 -08:00
Evan Tschannen
89f0f9318a decodeServerTagValue decodes tags encoded pre-6.0 2018-01-20 10:33:13 -08:00
Evan Tschannen
15962cf079 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbrpc/Locality.cpp
#	fdbrpc/Locality.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/ClusterRecruitmentInterface.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/masterserver.actor.cpp
#	fdbserver/worker.actor.cpp
#	flow/error_definitions.h
2017-10-05 17:09:44 -07:00