416 Commits

Author SHA1 Message Date
Josh Slocum
d37b2b0a76
Adding BlobFailureInjection workload (#9833)
* Adding BlobFailureInjection workload

* fixing formatting
2023-04-06 15:10:36 -05:00
Zhe Wu
d576d9a66a Remote debug TraceEvent 2023-03-27 11:47:11 -07:00
Zhe Wu
40dc54223c Add GC generation test, and make all simulation test passing 2023-03-27 11:46:13 -07:00
Zhe Wu
b4e62b9b3e Update log cursor timeout check 2023-03-21 22:03:17 -07:00
Jingyu Zhou
5c97fb2c20 Use a constant for connectionFailuresDisableDuration 2023-03-09 09:50:24 -08:00
Jingyu Zhou
e18ed14278 Refactor to address comments 2023-03-09 09:39:27 -08:00
Jingyu Zhou
493e81f31d Limit connection failures to be within tests
In particular, disable connection failures when initializing the database
during the startup phase, i.e., before running with test specs.
2023-03-08 15:36:58 -08:00
Russell Sears
bcc05b1058 Improve support for prebuilt boost 2023-02-27 15:38:58 -06:00
Jingyu Zhou
9a257a60a4 Address review comments 2023-02-24 10:47:32 -08:00
Jingyu Zhou
0b2e02c402 Fix rare test failures
Unclog after DB is recovered, otherwise another recovery may become stuck again.
2023-02-23 15:42:33 -08:00
Jingyu Zhou
65443b6541 Fix compiling errors 2023-02-23 15:02:44 -08:00
Jingyu Zhou
ecae81882c Change to only clog once for a particular tlog
If we repeat clogging, different tlogs may be excluded, which can cause the
recovery to stuck.
2023-02-23 14:31:39 -08:00
Jingyu Zhou
955826f2fe Add ClogTlog workload 2023-02-23 14:31:12 -08:00
Junhyun Shim
d9c126a2d9
Introduce WipedString for Arena block holding AuthZ tokens (#9381)
* Enable secure allocation mode in Arena

This mode allows zeroing out blocks holding sensitive data after use

* Introduce WipedString to all token-holding memory

Also introduce a option flag "sensitive"

* Make pointer equivalency a hard requirement for non-ASAN builds

So that we can detect when Arena/malloc/memory-wipe behavior changes
2023-02-16 10:44:32 +01:00
Jingyu Zhou
622520bd2d Return the source team if remote DC is dead
Also refactor the code with findTeamFromServers().
2023-02-10 11:11:07 -08:00
Jingyu Zhou
6c4a9b5f23 Fix DD stuck when remote DC is dead
When remote DC is down, the remote team collection of DD can initializing
waiting for the remote to recover (all_tlog_recruited state). However, the
getTeam request can already be served by the remote team collection. So, for
a RelocateShard (data movement such as split, move), it will get a team for
the remote DC. But the data movement can't make progress on the remote team
because the remote DC hasn't recovered yet. Because of the stuck of data
movement, the primary cannot reach the "storage_recovered" state and stay in
accepting_commit state.

The specifc test failure: slow/ApiCorrectness.toml -s 339026305 -b on
at commit:  0edd899d65

In this test, primary DC has 1 SS killed, remote DC has 2 TLog and 2 SS killed.
So the remote is dead, the remaining 2 SSes can't make progress because of the
loss of 2 TLogs. The repairDeadDatacenter() can't reach the "storage_recovered"
state due to DD's failure of moving shards away from the killed SS in the
primary.

The fix is to exclude all remote in repairDeadDatacenter() so that tells DD to
mark all SSes in the remote as unhealthy. Another fix is to return empty
results for getTeam request if the remote team collection is not ready. This
will allow the data movement to continue, essentially remote team is not changed
for the data movement.
2023-02-10 11:11:07 -08:00
Junhyun Shim
be225acd2a Merge remote-tracking branch 'origin/main' into authz-tenant-name-to-tenant-id 2023-02-06 23:13:43 +01:00
Xiaoxi Wang
7190fa0c08 Merge branch 'main' of https://github.com/apple/foundationdb into fix/main/testTimeout 2023-02-03 13:48:54 -08:00
Xiaoxi Wang
b757e8914a fix BOOST_SYSTEM_NO_LIB redefinition in CI 2023-02-03 13:47:50 -08:00
Junhyun Shim
ce652fa284 Replace AuthZ's use of tenant names in token with tenant ID
Also, to minimize audit log loss, handle token usage audit logging at each usage.
This has a side-effect of making the token use log less bursty.
This also subtly changes the dedup cache policy.
Dedup time window used to be 5 seconds (default) since the start of batch-logging.
Now it's 5 seconds from the first usage since the closing of the previous dedup window
2023-02-03 21:46:31 +01:00
Jingyu Zhou
e96adfa449 Fix excessive killing for HA configuration
In the HA configuration, it's possible the remote DC was killed 2 out of 3
machines, left not enough machines for a successful recovery. So this PR changes
to Reboot to avoid such excessive killings.
2023-02-01 15:16:10 -08:00
Chaoguang Lin
4c5cbe6cda Merge branch 'main' of github.com:apple/foundationdb into fix-nightly-failure 2023-01-25 18:43:37 -08:00
Chaoguang Lin
fce9490c19 A Fix from Evan 2023-01-25 15:55:24 -08:00
Xiaoge Su
eb4e147ebf Reformat source 2023-01-24 15:06:27 -08:00
Xiaoge Su
0a60142160 Extract ProcessInfo, MachineInfo, KillType out from ISimulator 2023-01-24 14:48:42 -08:00
Xiaoge Su
50de69c897 Extract IConnection and NetworkAddress out from network.h 2023-01-24 14:48:31 -08:00
Xiaoge Su
3f03a6b12d Extract out IPAddress and IUDPSocket 2023-01-24 14:47:39 -08:00
sfc-gh-tclinkenbeard
986c792a9f Drop UDP packets more frequently in simulation 2023-01-15 17:32:57 -08:00
Kevin Hoxha
407c371635 metrics: Add simulation testing and fix incorrect TraceEvent names
- Added a background actor that listens on METRICS_EMISSION_UDP_PORT for incoming metrics (and verifies they are in the correct format)
- TraceEvent details have certain requirements for naming. This commit makes a seperate name for Counter/LatencySample and its underlying IMetric to avoid
those issues
2022-12-08 10:07:11 -08:00
Hui Liu
891331caed
Merge pull request #8881 from sfc-gh-huliu/fixinit
Init blobGranulesEnabled in ISimulator
2022-11-18 17:07:57 -08:00
Hui Liu
bee0377b4d Init blobGranulesEnabled in ISimulator 2022-11-18 15:53:06 -08:00
Junhyun Shim
bfefbfee8c
Merge pull request #8705 from sfc-gh-jshim/authz-accept-base64-for-jwt-tenant-name
Make token's 'tenants' field base64-encoded (cf. base64url)
2022-11-16 10:17:10 +01:00
Markus Pilman
503769ef05
Merge pull request #8496 from sfc-gh-mpilman/bugfixes/machines-attrition-debugging
Enable machine attrition injection
2022-11-15 16:32:33 -07:00
Junhyun Shim
41ea1678d0 Merge remote-tracking branch 'origin/main' into authz-accept-base64-for-jwt-tenant-name 2022-11-15 22:57:49 +01:00
sfc-gh-tclinkenbeard
c03f60c618 Update rare code probe annotations 2022-11-15 13:21:25 -08:00
Markus Pilman
f105cb1809 Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging 2022-11-14 10:11:52 -07:00
Markus Pilman
40c1bbc49a Fix gcc problem with typenames 2022-11-09 10:14:13 -07:00
Markus Pilman
6643ed0a26 fix print-sim-time 2022-11-08 12:19:39 -07:00
Junhyun Shim
112363ef14 Merge remote-tracking branch 'origin/main' into authz-accept-base64-for-jwt-tenant-name 2022-11-08 13:16:08 +01:00
Junhyun Shim
50f4021cf7 Make token's 'tenants' field base64-encoded (cf. base64url)
- Remove redundant operation from TokenSign
- Let the sign/verify API directly report errors
  instead of tracing at failing subroutine, which lacks context
2022-11-04 20:17:08 +01:00
Josh Slocum
cff99a64f6
Blob Granule Attrition fixes (#8682)
* Assert was incorrect in change feed destroy race with moved() clearing map

* fixing race between injected fault and granule revoke

* Handling race in sim2 blob worker attrition check
2022-11-03 18:48:10 -05:00
Markus Pilman
f1fea14255 Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging 2022-11-01 13:51:35 -06:00
Lukas Joswiak
5ca2b89bdf Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
f43011e4b7 Notify processes joining the wrong cluster
And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
a72066be33 Add simulation support for changing the cluster file 2022-10-27 13:56:13 -07:00
Markus Pilman
3c943ac37a fix merge bugs 2022-10-26 10:42:11 -06:00
Markus Pilman
e7b5b870a3 Merge remote-tracking branch 'origin/main' into bugfixes/machines-attrition-debugging 2022-10-24 15:24:36 -06:00
Markus Pilman
2310584a05 Merge remote-tracking branch 'sfc/bugfixes/machines-attrition-debugging' into bugfixes/machines-attrition-debugging 2022-10-24 15:01:03 -06:00
Markus Pilman
43cafb0bc2 Track disk corruptions and mark resulting failures as injected 2022-10-24 14:54:43 -06:00
Andrew Noyes
fb9333e863 Delete Sim2::PromiseTask
Previously this was leaking and causing simulation OOM's. Also make it
FastAllocated to match Net2::PromiseTask
2022-10-24 09:25:21 -07:00