Changes:
LogSystem.h, LogSystemPeekCursor.actor.cpp:
Add APIs to find the ID of the tLog from which an SS has fetched the latest
set of versions.
storageserver.actor.cpp:
Capture the number of latest set of versions fetched, the time (in seconds)
in which those versions were fetched, and the tLog from which they were
fetched. Add this information to a TraceLogEvent.
Capture how many versions an SS has fetched in the
All TraceLogs that are related to ConnectionReset should be prefixed with
ConnectionReset. This should make it easy to query and aggregate by address and
role.
This introduces unhygenic macro variants that inline a `ENABLED &&`
before the TraceEvent. This way, they get entirely compiled out unless
enabled.
Then rewrite all debugMutation uses via sed.
Like:
* Leaving the proxy
* Entering the TLog
* Leaving the TLog
* Being read on a cursor
All of this brought to you by TagsAndMessage!
This also slides in a minor optimization as to how mutations are serialized per target log.
With TLS, a worker (or process) can have a TLS address and non-TLS address.
When a process is created in simulation, the primary address is TLS by default.
The non-TLS one is the TLS address port plus one.
In a connection between two workers, if their primary addresses do not enable
or disable TLS together, one worker will swap its primary address and secondary address
so that the TLS config of the two endpoints can match.
The swap can make the primary address no longer the TLS one that was created
when the process is created. And the swap only happens for worker instead of
process struct in simulation.
This swap can cause worker->address != process->address.
In checkForExtraDataStores actor, we use worker->address to check if a process
is killable and use the process->address to kill the process. The inconsistency
can cause simulation to kill a protected process that is not killable and leads
to simulation failure.
TLogServer and LogRouter had some leftover code from me trying to be
more "correct" about parallel peek semantics, but those changes weren't
reflected in the OldTLog* files. I've reverted the changes, as
realistically, they are more likely to waste CPU than improve TLog behavior.
This fixes an issue where one could hang for 10min for the second
parallel peek to time out, if one happened to catch the edge of a
onlySpilled transition wrong.
This is to guard against the case where
1. Peeks with sequence numbers 0-39 are submitted
2. A 15min pause happens, in which timeout removes the peek tracker data
3. Peeks with sequence numbers 40-59 are submitted, with the same peekId
The second round of peeks wouldn't have the data left that it's allowed
to start running peek 40 immediately, and thus would hang for 10min
until it gets cleaned up.
Also, guard against overflowing the sequence number.
This fixes an issue where one could hang for 10min for the second
parallel peek to time out, if one happened to catch the edge of a
onlySpilled transition wrong.
ServerPeekCursor::nextMessage() should only consume the message header, because
the reader() directly inherits the current position. The previous commit
changes the positon to the begining of the next message, which breaks storage
server code.