* Enable secure allocation mode in Arena
This mode allows zeroing out blocks holding sensitive data after use
* Introduce WipedString to all token-holding memory
Also introduce a option flag "sensitive"
* Make pointer equivalency a hard requirement for non-ASAN builds
So that we can detect when Arena/malloc/memory-wipe behavior changes
When remote DC is down, the remote team collection of DD can initializing
waiting for the remote to recover (all_tlog_recruited state). However, the
getTeam request can already be served by the remote team collection. So, for
a RelocateShard (data movement such as split, move), it will get a team for
the remote DC. But the data movement can't make progress on the remote team
because the remote DC hasn't recovered yet. Because of the stuck of data
movement, the primary cannot reach the "storage_recovered" state and stay in
accepting_commit state.
The specifc test failure: slow/ApiCorrectness.toml -s 339026305 -b on
at commit: 0edd899d65
In this test, primary DC has 1 SS killed, remote DC has 2 TLog and 2 SS killed.
So the remote is dead, the remaining 2 SSes can't make progress because of the
loss of 2 TLogs. The repairDeadDatacenter() can't reach the "storage_recovered"
state due to DD's failure of moving shards away from the killed SS in the
primary.
The fix is to exclude all remote in repairDeadDatacenter() so that tells DD to
mark all SSes in the remote as unhealthy. Another fix is to return empty
results for getTeam request if the remote team collection is not ready. This
will allow the data movement to continue, essentially remote team is not changed
for the data movement.
Also, to minimize audit log loss, handle token usage audit logging at each usage.
This has a side-effect of making the token use log less bursty.
This also subtly changes the dedup cache policy.
Dedup time window used to be 5 seconds (default) since the start of batch-logging.
Now it's 5 seconds from the first usage since the closing of the previous dedup window
In the HA configuration, it's possible the remote DC was killed 2 out of 3
machines, left not enough machines for a successful recovery. So this PR changes
to Reboot to avoid such excessive killings.
- Added a background actor that listens on METRICS_EMISSION_UDP_PORT for incoming metrics (and verifies they are in the correct format)
- TraceEvent details have certain requirements for naming. This commit makes a seperate name for Counter/LatencySample and its underlying IMetric to avoid
those issues
- Remove redundant operation from TokenSign
- Let the sign/verify API directly report errors
instead of tracing at failing subroutine, which lacks context
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.
The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.
Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.
This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.