At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
responsible to recruit all other processes as well restore the
cluster state.
Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.
Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
process like other worker processes compared to current scheme
where "sequencer" process gets special treatment. In newer scheme
sequencer is responsible for maintaining/providing
"committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
the sequencer though orchestrating the recovery state machine, it
need to reachout to the ClusterController for recruiting worker
processes etc.
NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.
Next Steps:
Cluster recovery documentation will be updated in near future.
* Unify flags implementation and change help text in backup.actor.cpp
Description
Testing
* Keep LOG_GROUP unchanged
Description
Testing
* Transfer the hyphens to underscores for internal options and user's input, EXCEPT leading hyphens
Description
Testing
* Use a deep copy of the user's input flag to do the match
Description
Testing
* Convert the _ to - in Option arrays of backup.actor.cpp
Description
Testing
* Transter _ to - for files:
TLSConfig.actor.h, fdbcli.actor.cpp, fdbserver.actor.cpp, FileConverter.h, FileConverter.cpp
Description
Testing
* Change another way to unify flag: using SO_O_ICASE_HYPHEN_AND_UNDERSCORE to determine whether we do the conversion in function IsEqual
Description
Testing
* Change the config command's name from SO_O_ICASE_HYPHEN_AND_UNDERSCORE to SO_O_HYPHEN_TO_UNDERSCORE
Description
Testing
* Update the comment for the SO_O_HYPHEN_TO_UNDERSCORE
Description
Testing
* Fix left underscore in SOption arrays
Description
Testing
* Convert _ to - in several files for commands
Description
Testing
* Make the FDBService and fdbmonitor backward compatible
Description
Testing
* Fix bugs about pointers
Description
Testing
* Check underscore and hyphen at the same time for --knob_, --localily_ and --test_
And fix bugs in fdbmonitor and FDBService
Description
Testing
* Simplify the function in fdbmonitor and FDBService about retrieving arguments.
And fix some documents in masterserver.actor.cpp
Description
Testing
* Convert _ to - for knob in the setKnob functions
Description
Testing
* Convert - to _ in the setKnob functions
Description
Since key in the knob related maps only contain _
Testing
* Rename varialbe name in the fdbmonitor and FDBService for clarification
Description
Testing
Co-authored-by: Chang Liu <chang.liu@snowflake.com>
1. Add a trace event when a database is created and move the cluster file / connection string from ClientStart to the new trace event
2. Add a detail for the path to the image being loaded
3. Add a detail for whether a client library is primary or not
4. Set a thread name for each external client thread that includes the release version
Patch improves on handling scenarios where either commit or grv proxies
value is update to -1 OR `proxies_count` is being reset.
The code splits the proxies between two proxies by ensuring for invalid
input configuration, the min (read as 1) proxies gets provisioned, otherwise,
the split is done based on input values
Patch handles the scenario where mutation supplied values to update grv_proxies
and/or commit_proxies is -1, however, the total proxy count > 1,
uses DEFAULT_COMMIT_GRV_PROXIES_RATIO to split proxies between
grv_proxies & commit_proxies.
This eliminates many useless warnings when compiling.
`#pragma message: The practice of declaring the Bind placeholders (_1, _2, ...) in the global namespace is deprecated. Please use <boost/bind/bind.hpp> + using namespace boost::placeholders, or define BOOST_BIND_GLOBAL_PLACEHOLDERS to retain the current behavior.`
Transactions (created on a separate thread) can read the `globals` field
at the same time as `setGlobal` is called on the main thread, causing a
potential race. TSAN surfaced this issue.
* Redwood files now growth in large page chunks controlled by a knob to reduce truncate() calls for expansion. PriorityMultiLock has limit on consecutive same-priority lock release. Increased Redwood max priority level to 3 for more separation at higher BTree levels.
* Simulation fix, don't mark certain IO timeout errors as injected unless the simulated process has been set to have an unreliable disk.
* Pager writes now truncate gradually upward, one chunk at a time, in response to writes, which wait on only the necessary truncate operations. Increased buggified chunk size because truncate can be very slow in simulation.
* In simulation, ioTimeoutError() and ioDegradedOrTimeoutError() will wait until at least the target timeout interval past the point when simulation is sped up.
* PriorityMultiLock::toString() prints more info and is now public.
* Added queued time to PriorityMultiLock.
* Bug fix to handle when speedUpSimulation changes later than the configured time.
* Refactored mutation application in leaf nodes to do fewer comparisons and do in place value updates if the new value is the same size as the old value.
* Renamed updatingInPlace to updatingDeltaTree for clarity. Inlined switchToLinearMerge() since it is only used in one place.
* Updated extendToCover to be more clear by passing in the old extension future as a parameter. Fixed initialization warning.