Found by simulation:
seed: -f tests/slow/ApiCorrectnessAtomicRestore.toml -s 177856328 -b on
Commit: 51ad8428e0fbe1d82bc76cf42b1579f51ecf2773
Compiler: clang++
Env: Rhel9 okteto
applyMutations() has processed version 801400000-803141392, and before calling sendCommitTransactionRequest(),
which was going to update apply begin version to 803141392. But DID NOT wait for the transaction commit.
Then there is an update on the apply end version to 845345760, which picks up the PREVIOUS apply begin version 801400000.
Thus started another applyMutation() with version range 801400000-845345760. Note because previous
applyMutation() has finished and didn't wait for the transaction commit, thus the starting version
is wrong. As a result, this applyMutation() re-processed version range 801400000-803141392.
The test failed during re-processing, because mutations are missing for the overlapped range.
The fix is to wait for the transaction to commit in sendCommitTransactionRequest().
This bug probably affects DR as well.
See rdar://146877552
20250317-162835-jzhou-ff4c4d6d7c51bfed
Tighten up options for bulk*. Compound 'local' and 'blobstore' as 'dump'/'load'. Ditto for 'history'.
Make it so 'bulkload mode' works like 'bulkdump mode': i.e. dumps current mode.
If mode is not on for bulk*, ERROR in same manner as for writemode.
Make it so we can return bulk* subcommand specific help rather than dump all help when an issue.
Make the commands match in the ctest
* Hash file before uploading. Add it as tag after successful
multipart upload. On download, after the file is on disk,
get its hash and compare to that of the tag we get from s3.
* fdbclient/CMakeLists.txt
Be explicit what s3client needs.
* fdbclient/S3BlobStore.actor.cpp
* fdbclient/include/fdbclient/S3BlobStore.h
Add putObjectTags and getObjectTags
* fdbclient/S3Client.actor.cpp
Add calculating checksum, adding it as
tags on upload, fetching on download,
and verifying match if present.
Clean up includes.
Less logging.
* fdbclient/tests/s3client_test.sh
Less logging.
* Make failed checksum check an error (and mark non-retryable)
---------
Co-authored-by: michael stack <stack@duboce.com>
* fdbcli/BulkDumpCommand.actor.cpp
* fdbcli/BulkLoadCommand.actor.cpp
Print out the bulkdump description rather than usage so user
has a chance of figuring out what it is they entered incorrectly.
Make bulkdump and bulkload align by using 'cancel' instead of
'clear' in both and ordering the sub-commands the same for
bulkload and bulkdump. Add more help to the description.
Bulkload was missing mention of the jobid needed
specifying a bulkload.
* documentation/sphinx/source/bulkdump.rst
s/clearBulkDumpJob/cancelBulkDumpJob/
Co-authored-by: stack <stack@duboce.com>
* * fdbclient/S3Client.actor.cpp
Change field names so capitialized (convention)
Add duration as field to traces.
* fdbserver/BulkLoadUtil.actor.cpp
When the job-manifest is big, processing blocks
so much getBulkLoadJobFileManifestEntryFromJobManifestFile
fails.
* Make bulkload file reads and writes async and memory parsimonious.
In tests at scale, processing a large job-manifest.txt was blocking
and causing the bulk job to fail. This is part 1 of two patches.
The second is to address data copy added in the below when we
made methods ACTORs (ACTOR doesn't allow passing by reference).
* fdbserver/BulkDumpUtil.actor.cpp
Removed writeStringToFile and buldDumpFileCopy in favor of new methods
in BulkLoadUtil. Made hosting functions ACTORs so could wait on
async calls.
* fdbserver/BulkLoadUtil.actor.cpp
Added async read and write functions.
* fdbserver/DataDistribution.actor.cpp
Making uploadBulkDumpJobManifestFile async made it so big bulkloads
work.
* fix memory corruption in writeBulkFileBytes and fix read options in getBulkLoadJobFileManifestEntryFromJobManifestFile
* If read or write < 1MB, do it in a single read else do multiple read/writes
* * packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go
Just be blunt and write out the credentials. Trying to figure when the
blob credentials have expired is error prone.
Co-authored-by: michael stack <stack@duboce.com>
Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
* packaging/docker/Dockerfile
Add fdb-aws-s3-credentials-fetcher-sidecar container.
Runs perpetual script that writes blob-credentials.json to /var/fdb.
* packaging/docker/build-images.sh
Build and publish new sidecar container
* packaging/docker/fdb-aws-s3-credentials-fetcher/README.md
* packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go
* packaging/docker/fdb-aws-s3-credentials-fetcher/go.mod
* packaging/docker/fdb-aws-s3-credentials-fetcher/go.sum
Script that fetches credentials via IRSA (IAM Roles for Service Accounts).
* packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go
Match the key generated by fdbserver internally.
* fdbclient/S3BlobStore.actor.cpp
Add some logging around fail-to-find-credentials -- why.
* * fdbclient/tests/aws_fixture.sh
Use the fdb-aws-s3-credentials-fetcher script fetching credentials if available in ctests.
* fdbclient/tests/s3client_test.sh
TMPDIR might not be defined when we print usage.
Co-authored-by: Johannes Scheuermann <johscheuer@users.noreply.github.com>
* Add multiparting to s3client.
Fix boost::urls::parse_uri 's dislike of credentialed blobstore urls.
* fdbclient/BulkLoading.cpp
Add blobstore regex to extract credentials before feeding the boost
parse_uri.
* fdbclient/include/fdbclient/S3BlobStore.h
* fdbclient/S3BlobStore.actor.cpp
Add cleanup of failed multipart -- abortMultiPartUpload l(s3 will do
this in the background eventually but lets clean up after ourselves).
Also add getObjectRangeMD5 so can do multipart checksumming.
* fdbclient/S3Client.actor.cpp
Change upload file and download file to do multipart always.
Retry too.
* fdbclient/S3Client_cli.actor.cpp
Add command line to trace rather than output.
* Address Zhe review
* More logging around part upload and download
* Undo assert that proved incorrect; restore the old length math
doing copy in readObject.
Cleanup around TraceEvents in HTTTP.actor.
* Undo commented out cleanup -- for debugging
* formatting
---------
Co-authored-by: stack <stack@duboce.com>
* improve bulkload code
* address CI
* disable audit storage replica check and distributed consistency check in bulkload and bulkdump simulation test
* fix ci
* disable waitForQuiescence in bulkload and bulkdump tests
* bulkload support general engine and fix bugs
* add comments
* improve test coverage and fix bug
* nits and address comments
* nit
* nits
* fix data inconsistency bug due to bulkload metadata
* fix ss bulkload task metadata bugs
* nit and fix CI issue
* fix bugs of restore ss bulkload metadata
* use ssBulkLoadMetadata for fetchKey and general kv engine
* cleanup bulkload file for fetchkey
* fix CI issue
* fix simulation stuck due to repeated re-recruitment of unfit dd
* randomly do available space check when finding the dest team for bulkload in simulation
* address conflict
* code clean up
* update BulkDumping.toml same to BulkLoading.toml
* consolidate ss fetchkey and fetchshard failed to read bulkload task metadata
* fix DD bulkload job busy loop bug which causes segfault and test terminate unexpectedly in joshua test
* nit
* fix ss busy loop for bulkload in fetchkey
* use sqlite for bulkload ctest
* fix bulkload ctest stuck issue due to merge and change storage engine to ssd
* fix comments for CC recruit DD
* address comments
* address comments
* add comments
* fix ci format issue
* address comments
* add comments
* New restore consolidated commit
This change adds RestoreDispatchPartitionedTaskFunc to restore
from partitioned-format backup.
* ArenaBlock::totalSize parameter pass by ref
* Fix format issues identified by CI
It was disabled until we made it so the SS could
talk to s3, included in this PR.
Also finished the bulkload test. It only had the
bulkdump portion. bulkload support was recentlty
added so finish off the test here by adding bulkload
of the bulkdump and then verifying all data present.
Added passing knobs to the fdb cluster so available to the
fdbserver when it goes to talk to s3. Also added passing
SS count to start in fdb cluster.
* fdbclient/tests/fdb_cluster_fixture.sh
Add ability to pass multiple knobs to fdb cluster
and to specify more than just one SS.
* fdbserver/fdbserver.actor.cpp
Add --blob-server option and processing of FDB_BLOB_CREDENTIALS
if present (hijacked the unused, unadvertised --
blob-credentials-file).
* tests/loopback_cluster/run_custom_cluster.sh
Allow passing more than just one knob.
* fdbclient/BulkLoading.cpp
* fdbclient/include/fdbclient/BulkLoading.h
Added getPath
* fdbclient/S3BlobStore.actor.cpp
Fix bug where we were doubling up the first '/' on a path if
it had a root '/' already (s3 treats /a/b as distinct from
/a//b).
* fdbclient/S3Client.actor.cpp
Fix up of traceevent Types.
* fdbclient/tests/bulkload_test.sh
Enable being able to use s3 if available.
Pick up jobid when bulkdumping. Feed it to new bulkload
method. Add verification all data present post-bulkload.
* fdbserver/BulkLoadUtil.actor.cpp
Add support for blobstore.
* tests/loopback_cluster/run_custom_cluster.sh
Bug fix -- we were only able to pass in one knob. Allow
passing multiple.