23470 Commits

Author SHA1 Message Date
Steve Atherton
f4e8854e8c
Merge pull request #8517 from sfc-gh-etschannen/feature-disk-queue-perf
added a disk queue load generator
2022-10-28 14:15:55 -07:00
Jingyu Zhou
d672b1cbce
Merge pull request #8613 from sfc-gh-etschannen/fix-specific-unit-test
Specific unit test should only run one tests instead of all tests
2022-10-28 11:02:12 -07:00
Evan Tschannen
dd970a5c99 Specific unit test should only run one tests instead of all tests 2022-10-28 10:55:20 -07:00
Evan Tschannen
51e2f8e74b made the test clean up after itself 2022-10-28 09:26:48 -07:00
Andrew Noyes
0a15f081a1
Proactively clean up idempotency ids for successful commits (#8578)
* Proactively clean up idempotency ids for successful commits

This change also includes some minor changes from my branch working on
an idempotency ids cleaner, that I'd like to get merged sooner rather
than later.

- Adding a timestamp to idempotency values
- Making IdempotencyId an actor file
- Adding commit_unknown_result_fatal
- Checking idempotencyIdsExpiredVersion in determineCommitStatus
- Some testing QOL changes

* Factor out decodeIdempotencyKey logic

* Fix formatting

* Update flow/include/flow/error_definitions.h

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Use KeyBackedObjectProperty for idempotencyIdsExpiredVersion

* Add IDEMPOTENCY_ID_IN_MEMORY_LIFETIME knob

* Rename ExpireIdempotencyKeyValuePairRequest

Also add a code probe for the case where an ExpireIdempotencyIdRequest is
received before the count is known, and add an assert

* Fix formatting and add TODO for nwijetunga

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2022-10-28 09:07:54 -07:00
Jingyu Zhou
4d789b2fd9
Merge pull request #8602 from apple/revert-8498-mmmm
Revert "Cancel watch when the key is not being waited"
2022-10-27 21:14:39 -07:00
Jingyu Zhou
49645a7755 Revert "Clean up unused comment in flow.h"
This reverts commit 03b102d86aecbe700aa8402ae31d0431bfb0b2b9.
2022-10-27 19:46:05 -07:00
Jingyu Zhou
dc60f63f9b Revert "Cancel watch when the key is not being waited"
This reverts commit 639afbe62cc157a3428261bf8783088becc9ac13.
2022-10-27 19:46:05 -07:00
Jingyu Zhou
fbe9802be5 Revert "configurationMonitor does not need to check watch reference count"
This reverts commit ab0f827058c21dfab66462c3ce8545c6eec6a6e5.
2022-10-27 19:46:05 -07:00
Jingyu Zhou
634bd529e7 Revert "Record the version of each watch"
This reverts commit 4bd24e4d6460c5cf38117b89246561bb0d83e3ef.
2022-10-27 19:46:05 -07:00
Jingyu Zhou
19ae4e7eb7 Revert "Reformat source"
This reverts commit ec47c261bf743e4ffefbea2e70641afdf8f16491.
2022-10-27 19:46:05 -07:00
Jingyu Zhou
e460933b52 Revert "Remove debugging output"
This reverts commit 41d1d6404d933f0574d88d1fa2a68c642413bf4b.
2022-10-27 19:46:05 -07:00
Jingyu Zhou
e7fd3eda00 Revert "Update fdbclient/NativeAPI.actor.cpp"
This reverts commit 812243bafab4b8cb9cad49c7c22f16063f39b37e.
2022-10-27 19:46:05 -07:00
Lukas Joswiak
9625efd5b9 Add comment about configuration database 2022-10-27 13:56:13 -07:00
Lukas Joswiak
8e76621653 Disable shared state updates on configuration database 2022-10-27 13:56:13 -07:00
Lukas Joswiak
91146a03f0 Write cluster ID to ClientDBInfo
This enables clients to receive the cluster ID.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
28540e5962 Format 2022-10-27 13:56:13 -07:00
Lukas Joswiak
a8f8757f77 Rename cluster ID key
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has
been moved to the database. This was causing protocol compatibility
issues during upgrades, so we need to rename the key.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
02bc5edbf8 Avoid blocking in choose when 2022-10-27 13:56:13 -07:00
Lukas Joswiak
9d3c3b1efe Remove cluster ID logic from individual roles
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
2022-10-27 13:56:13 -07:00
Lukas Joswiak
1fca3b7ddc Modify how cluster ID tests are run in simulation 2022-10-27 13:56:13 -07:00
Lukas Joswiak
bba05b7c9b Move cluster ID from txnStateStore to the database
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
5ca2b89bdf Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
f43011e4b7 Notify processes joining the wrong cluster
And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.
2022-10-27 13:56:13 -07:00
Lukas Joswiak
72a97afcd6 Avoid recruiting workers with different cluster ID 2022-10-27 13:56:13 -07:00
Lukas Joswiak
a72066be33 Add simulation support for changing the cluster file 2022-10-27 13:56:13 -07:00
Jingyu Zhou
6e0835f8a8
Merge pull request #8599 from technmsg/main
updated copyright year on web site
2022-10-27 13:36:56 -07:00
Xiaoge Su
812243bafa Update fdbclient/NativeAPI.actor.cpp
Co-authored-by: Jingyu Zhou <jingyuzhou@gmail.com>
2022-10-27 12:42:05 -07:00
Xiaoge Su
41d1d6404d Remove debugging output 2022-10-27 12:42:05 -07:00
Xiaoge Su
ec47c261bf Reformat source 2022-10-27 12:42:05 -07:00
Xiaoge Su
4bd24e4d64 Record the version of each watch
In the case
    1. A watch to key A is set, the watchValueMap ACTOR, noted as X, starts waiting.
    2. All watches are cleared due to connection string change.
    3. The watch to key A is restarted with watchValueMap ACTOR Y.
    4. X receives the cancel exception, and tries to dereference the counter. This causes Y gets cancelled.

the reference count will cause watch prematurely terminate. Recording
the versions of each watch would help preventing this issue
2022-10-27 12:42:05 -07:00
Xiaoge Su
ab0f827058 configurationMonitor does not need to check watch reference count 2022-10-27 12:42:05 -07:00
Xiaoge Su
639afbe62c Cancel watch when the key is not being waited
Currently, there is a cyclic reference situation in

    DatabaseContext -> WatchMetadata -> watchStorageServerResp ->
    DatabaseContext

If there is a watch created in the DatabaseContext, even the
corresponding wait ACTOR is cancelled, the WatchMetadata will still hold
a reference to watchStorageServerResp ACTOR, which holds a reference to
DatabaseContext.

In this situation, any DatabaseContext who held a watch will not be
automatically destructed since its reference count will never reduce to
0 until the watch value is changed. Every time the cluster recoveries,
several watches are created, and when the cluster restarts, the
DatabaseContext which not being used, will not be able to destructed due
to these watches.

With this patch, each wait to the watch will be counted. Either the
watch is triggered or cancelled, the corresponding count will be
reduced. If a watch is not being waited, the watch will be cancelled,
effectively reduce the reference count of DatabaseContext. This will
hopefully fix the issue mentioned above.

The code is tested by 1) Manually change the number of logs of a local
cluster, see the cluster recovery and previous DatabaseContext being
destructed; 2) 100K joshua run, with 1 failure, the same test will fail
on the current git main branch.
2022-10-27 12:42:05 -07:00
Xiaoge Su
03b102d86a Clean up unused comment in flow.h 2022-10-27 12:42:05 -07:00
Alex Moundalexis
67049518b9
updated copyright year on web site 2022-10-27 15:05:52 -04:00
Nim Wijetunga
bf01d9b879
Bulk Setup Workload Improvements (#8573)
* bulk setup  workload improvements

* fix workload

* modify
2022-10-27 11:10:14 -07:00
Jingyu Zhou
fe66c026b4
Merge pull request #8598 from jzhou77/fix
Fix restarting restore test failure
2022-10-27 10:44:17 -07:00
Josh Slocum
4d3553481f
Blob connection provider test (#8478)
* Refactoring test blob metadata creation

* Implementing BlobConnectionProviderTest

* createRandomTestBlobMetadata supports blobstore and works outside simulation
2022-10-27 10:44:06 -05:00
Jingyu Zhou
6c0f890f78 Fix restarting restore test failure
Old fdbserver may not set the "enableSnapshotBackupEncryption" key, thus we
should allow the key to be not present.
2022-10-27 08:43:55 -07:00
Vaidas Gasiunas
c6adb3a98c
Building fdb_c_shim to a shared library (#8586) 2022-10-27 12:37:20 +02:00
Markus Pilman
2bf9c2f448
Merge pull request #8588 from sfc-gh-mpilman/bugfixes/fix-build-dependencies
Fix AWS SDK build and removed check for old build system
2022-10-26 12:36:08 -06:00
Dennis Zhou
deeedfc3f8
Merge pull request #8537 from sfc-gh-dzhou/unblob
blob: allow purge ranges to begin and end in unblobbified regions
2022-10-26 11:11:09 -07:00
Markus Pilman
989731f7f4 Fix AWS SDK build and removed check for old build system 2022-10-26 11:48:10 -06:00
Aaron Molitor
f620f391f5 make same change to Dockerfile.eks (from #8583) 2022-10-26 12:24:37 -05:00
Josh Slocum
623e6ef761
adding delay in bw forced shutdown to prevent crash races (#8552) 2022-10-26 12:22:41 -05:00
Nim Wijetunga
6f37f55917
Restore System Keys First in Backup/Restore Workloads (#8475)
* system key restore ordering

* restore system keys before regular data

* atomic restore backup fix

* change testing

* fix compile error

* fix compile issue

* fix compile issues

* Trigger Build

* only split restore if encryption is enabled

* revert knob changes

* Update fdbserver/workloads/AtomicSwitchover.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/AtomicSwitchover.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/BackupCorrectness.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/AtomicRestore.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* add todo

* strengthen check

* seperate system restore for atomic restore

* address pr comments

* address pr comments

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2022-10-26 09:38:27 -07:00
Josh Slocum
ab6953be7d
Blob Granule read-driven compaction (#8572) 2022-10-26 09:02:50 -07:00
Aaron Molitor
b8b7b46d8f update kubectl and awscli 2022-10-26 10:52:05 -05:00
Marian Dvorsky
3c5d3f7a94
Fix SpanContext for GP:getLiveCommittedVersion (#8565)
* Fix SpanContext for GP:getLiveCommittedVersion
2022-10-26 16:29:28 +02:00
Junhyun Shim
32099bfce5
Merge pull request #8564 from sfc-gh-jshim/enable-authz-benchmark-in-mako
Enable authz/TLS-enabled benchmark in mako
2022-10-26 14:55:53 +02:00