256 Commits

Author SHA1 Message Date
Dan Lambright
5bdc525353
Merge branch 'main' into vv 2022-04-08 13:16:04 -04:00
Renxuan Wang
465ff712b6
Move Hostname to its own files. (#6759)
* Change DNS cache to use std::map.

Revert commit 90c259d84e95dd35e01149c0a86bd18e82e33930, because if we use unordered_map, toString() can be inconsistent.

* Move ClientKnob::COORDINATOR_HOSTNAME_RESOLVE_DELAY to FlowKnob::HOSTNAME_RESOLVE_DELAY.

* Move Hostname to its own files.

Also, add resolve-related variables and functions in Hostname.
2022-04-04 19:04:51 -07:00
Renxuan Wang
5a336655f1 Use unordered_map in DNS cache. 2022-04-04 15:08:17 -07:00
Renxuan Wang
7da31857b7 Address comments. 2022-04-04 15:08:17 -07:00
Renxuan Wang
e548c0d604 Add DNS cache. 2022-04-04 15:08:17 -07:00
Renxuan Wang
ff934ca2ad Change MockDNS to DNSCache. 2022-04-04 15:08:17 -07:00
Jingyu Zhou
64d4658034 Merge branch 'main' into vv
Fix Conflicts:
	flow/error_definitions.h
2022-04-01 21:49:24 -07:00
Steve Atherton
6eb1c2ae48
Merge pull request #6574 from sfc-gh-satherton/redwood-rare-bugs
Rare correctness bug fixes in Redwood
2022-04-01 16:40:22 -07:00
Chaoguang Lin
7d365bd1bb
Remote ikvs debugging (#6465)
* initial structure for remote IKVS server

* moved struct to .h file, added new files to CMakeList

* happy path implementation, connection error when testing

* saved minor local change

* changed tracing to debug

* fixed onClosed and getError being called before init is finished

* fix spawn process bug, now use absolute path

* added server knob to set ikvs process port number

* added server knob for remote/local kv store

* implement simulator remote process spawning

* fixed bug for simulator timeout

* commit all changes

* removed print lines in trace

* added FlowProcess implementation by Markus

* initial debug of FlowProcess, stuck at parent sending OpenKVStoreRequest to child

* temporary fix for process factory throwing segfault on create

* specify public address in command

* change remote kv store knob to false for jenkins build

* made port 0 open random unused port

* change remote store knob to true for benchmark

* set listening port to randomly opened port

* added print lines for jenkins run open kv store timeout debug

* removed most tracing and print lines

* removed tutorial changes

* update handleIOErrors error handling to handle remote-ikvs cases

* Push all debugging changes

* A version where worker bug exists

* A version where restarting tests fail

* Use both the name and the port to determine the child process

* Remove unnecessary update on local address

* Disable remote-kvs for DiskFailureCycle test

* A version where restarting stuck

* A version where most restarting tests green

* Reset connection with child process explicitly

* Remove change on unnecessary files

* Unify flags from _ to -

* fix merging unexpected changes

* fix trac.error to .errorUnsuppressed

* Add license header

* Remove unnecessary header in FlowProcess.actor.cpp

* Fix Windows build

* Fix Windows build, add missing ;

* Fix a stupid bug caused by code dropped by code merging

* Disable remote kvs by default

* Pass the conn_file path to the flow process, though not needed, but the buildNetwork is difficult to tune

* serialization change on readrange

* Update traces

* Refactor the RemoteIKVS interface

* Format files

* Update sim2 interface to not clog connections between parent and child processes in simulation

* Update comments; remove debugging symbols; Add error handling for remote_kvs_cancelled

* Add comments, format files

* Change method name from isBuggifyDisabled to isStableConnection; Decrease(0.1x) latency for stable connections

* Commit the IConnection interface change, forgot in previous commit

* Fix the issue that onClosed request is cancelled by ActorCollection

* Enable the remote kv store knob

* Remove FlowProcess.actor.cpp and move functions to RemoteIKeyValueStore.actor.cpp; Add remote kv store delay to avoid race; Bind the child process to die with parent process

* Fix the bug where one process starts storage server more than once

* Add a please_reboot_remote_kv_store error to restart the storage server worker if remote kvs died abnormally

* Remove unreachable code path and add comments

* Clang format the code

* Fix a simple wait error

* Clang format after merging the main branch

* Testing mixed mode in simulation if remote_kvs knob is enabled, setting the default to false

* Disable remote kvs for PhysicalShardMove which is for RocksDB

* Cleanup #include orders, remove debugging traces

* Revert the reorder in fdbserver.actor.cpp, which fails the gcc build

Co-authored-by: “Lincoln <“lincoln.xiao@snowflake.com”>
2022-03-31 17:08:59 -07:00
Jingyu Zhou
cfcf0f152c Merge branch 'main-4a085fc84' into vv
Fix Conflicts:
	fdbclient/NativeAPI.actor.cpp
	fdbserver/ClusterRecovery.actor.cpp
	fdbserver/MasterInterface.h
	fdbserver/masterserver.actor.cpp
	flow/error_definitions.h
2022-03-30 22:28:06 -07:00
Jingyu Zhou
e9659b5dd4 Merge branch 'master-PR-6500' into vv
Fix Conflicts:
	fdbclient/CommitProxyInterface.h
	fdbclient/NativeAPI.actor.cpp
	fdbserver/masterserver.actor.cpp
2022-03-30 14:53:49 -07:00
Steve Atherton
2a52c76b7a Added INetwork::timer_int() for convenience. Clarified what timer_int() actually returns in header comments. 2022-03-30 14:47:24 -07:00
sfc-gh-tclinkenbeard
a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
Renxuan Wang
06b1d06d38 Support hostname in coordinators commands. 2022-02-24 23:02:29 -08:00
Sreenath Bodagala
72e06369a4 Merge remote-tracking branch 'apple-upstream/main' into version-vector-prototype 2022-02-08 17:47:57 +00:00
Renxuan Wang
2ea4146e1f Add resolveTCPEndpointBlocking() to resolve hostnames where async resolving is impossible. 2022-01-28 12:20:41 -08:00
Renxuan Wang
1d62bc9437 Hostname needs a default constructor. 2022-01-20 19:18:15 -08:00
Renxuan Wang
a6c482ee91 Add the support of using hostname in ConnectionString.
This PR comes without simulation tests except some unit tests. The simulation tests will be in the PR that uses hostname in code logic.
2022-01-20 19:18:15 -08:00
Dan Lambright
9544379cdf rebase 2022-01-20 11:12:33 -05:00
Renxuan Wang
28832a99d6 Address comment. 2022-01-18 14:34:18 -08:00
Renxuan Wang
b8bab06e16 Add the functions to set and get mock DNS.
These functions will be used in restarting tests, where mock DNS needs to be saved to and read from files.
2022-01-18 14:34:18 -08:00
Ata E Husain Bohra
936bf5336a
Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine" (#6191)
* Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine""

Major changes includes:
1. Re-revert Sequencer refactor commits listed below (in listed order):
1.a. This reverts commit bb17e194d9c9888e203421290959bd7f2c075d7f.
1.b. This reverts commit d174bb2e06bff01157d16c652073536c54d17f7f.
1.c. This reverts commit 30b05b469c87d9b526b427751c211fb5cf7ff9cd.

2. Update Status.actor to track ClusterController interface to track
   recovery status.
3. Introduce a ServerKnob to define "cluster recovery trace event"
   prefix; for now keeping it as "Master", however, it should allow
   smooth transition to "Cluster" prefix as it seems more appropriate.
2022-01-06 12:15:51 -08:00
Aaron Molitor
30b05b469c Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit dfe9d184ff5dd66bdbbc5b984688ac3ebb15b901.
2021-12-24 11:25:51 -08:00
Ata E Husain Bohra
dfe9d184ff Refactor: ClusterController driving cluster-recovery state machine
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Renxuan Wang
46d17d748f Change member variable fromHostname to type bool.
So that it can be serialized.
2021-11-23 14:25:02 -10:00
Lukas Joswiak
18243351e7 Fix possible data race
Transactions (created on a separate thread) can read the `globals` field
at the same time as `setGlobal` is called on the main thread, causing a
potential race. TSAN surfaced this issue.
2021-11-18 10:16:20 -08:00
negoyal
8b0938c7b3 Added a TODO 2021-11-17 09:45:50 -08:00
Steve Atherton
035e0d6e52
Merge branch 'master' into bit-flipping-workload 2021-11-16 14:42:22 -08:00
Renxuan Wang
4630b0ccea Move DNS mock from SimExternalConnection to Sim2.
This is a revise PR of #5934. In simulation, we don't have direct access to SimExternalConnection.
2021-11-15 17:02:51 -08:00
Renxuan Wang
f15ceb5489 Add Hostname struct, and fromHostname in NetworkAddress struct. 2021-11-04 11:42:28 -07:00
sfc-gh-tclinkenbeard
d0c9cf4fb0 Enable mismatched-tags clang warning 2021-11-01 14:18:31 -07:00
sfc-gh-tclinkenbeard
ebcc023b6f Enable missing-field-initializers clang warning 2021-11-01 14:18:31 -07:00
Dan Lambright
befe1993c4 fix conflict on rebase 2021-10-29 12:25:26 -04:00
negoyal
1e7338b6c3 Merge branch 'master' into bit-flipping-workload 2021-10-28 14:24:49 -07:00
Evan Tschannen
2208b04174
Merge pull request #5855 from sfc-gh-etschannen/blob_full_clean
Blob Granules V0
2021-10-26 09:57:35 -07:00
sfc-gh-tclinkenbeard
49a667c29b Improve const-correctness of INetwork 2021-10-25 14:42:31 -07:00
Josh Slocum
6a24ef9258 adding priorities to blob worker and fixing monitoring in blob manager 2021-10-05 16:51:19 -05:00
Suraj Gupta
95166796cd Address PR comments. 2021-10-04 20:16:22 -04:00
Suraj Gupta
4d54669ccd Recruit the blob workers via blob manager.
In this PR, the blob manager now recruits blob workers
(via communication with the cluster controller). Blob workers
are onboarded as blob worker processes enter the cluster.
2021-10-04 11:07:08 -04:00
negoyal
a7721d9786 Remove debug trace events and clang-format. 2021-09-10 15:41:22 -07:00
negoyal
7729a282ce Misc fixes and updated test toml file. 2021-09-08 14:31:09 -07:00
negoyal
a8baeb75d0 Misc fixes. 2021-09-03 15:03:12 -07:00
negoyal
3b34423248 Merge branch 'master' into bit-flipping-workload 2021-08-31 12:14:51 -07:00
Dan Lambright
8689e1f106 merge with master 2021-08-30 15:29:08 -04:00
Sreenath Bodagala
a081c0baa5 Merge remote-tracking branch 'apple-upstream/master' into version-vector-prototype 2021-08-05 22:40:32 +00:00
Lukas Joswiak
5dc9a97230 Merge branch 'master' into fixes/alp6 2021-08-01 20:42:52 -07:00
negoyal
9e7197faba Bunch of changes based on review comments and discussions. 2021-07-30 01:32:43 -07:00
negoyal
40b4f3b2f1 Merge branch 'master' into bit-flipping-workload 2021-07-28 18:06:07 -07:00
negoyal
050c218502 New Disk Delay Logic and ChaosMetrics. 2021-07-28 16:03:37 -07:00
Evan Tschannen
256a18e43b Flow transport uses an ordered delay to avoid out of order reply promise stream messages 2021-07-27 12:01:32 -07:00