7576 Commits

Author SHA1 Message Date
Zhe Wang
5cce92dcac
Simplify BulkLoad Job Metadata (#11959)
* address comments in the PR 11952

* code refactor and simplification

* avoid task outdated in DDBulkLoadJobExecute

* nit

* fix CI issue
2025-02-25 10:57:22 -08:00
Yao Xiao
67b9b5c9f3
Remove per thread histogram in storage engine and fix bugs in range scan. (#11967) 2025-02-25 10:52:46 -08:00
Zhe Wang
94faec13d5
Enable BulkLoad Job to Give Up Unretrievable Task and Fix DDStuck Bug (#11952)
* enable bulkload job to give up unretriable task

* fix ddstuck bug
2025-02-17 17:27:32 -08:00
Jingyu Zhou
1bd6f0aeab Save NOOP progress of backup workers
This is needed so that CC knows the lower bound of versions that can be included
in a backup.
2025-02-17 09:50:29 -08:00
Zhe Wang
d141eea3e1
Allow BulkLoadEngine to Handle Non-Retriable Task (#11950)
* enable-bulkload-engine-accept-unretriable-task

* nit and fmt

* fix bug
2025-02-14 10:52:29 -08:00
Zhe Wang
e070698ed0
DataMove Should Decide BulkLoading After Old DataMove Actor Has Been Cleared (#11947)
* fix bulkload bug

* fix CI
2025-02-13 15:35:55 -08:00
Jingyu Zhou
6a9898de44
Merge pull request #11904 from flowguru/backup1
Refactor backup mutation serialization
2025-02-13 14:18:04 -08:00
Michael Stack
ff22876247
Add multiparting to s3client. (#11920)
* Add multiparting to s3client.
Fix boost::urls::parse_uri 's dislike of credentialed blobstore urls.

* fdbclient/BulkLoading.cpp
 Add blobstore regex to extract credentials before feeding the boost
 parse_uri.

* fdbclient/include/fdbclient/S3BlobStore.h
* fdbclient/S3BlobStore.actor.cpp
 Add cleanup of failed multipart -- abortMultiPartUpload l(s3 will do
 this in the background eventually but lets clean up after ourselves).
 Also add  getObjectRangeMD5 so can do multipart checksumming.

* fdbclient/S3Client.actor.cpp
 Change upload file and download file to do multipart always.
 Retry too.

* fdbclient/S3Client_cli.actor.cpp
 Add command line to trace rather than output.

* Address Zhe review

* More logging around part upload and download

* Undo assert that proved incorrect; restore the old length math
doing copy in readObject.

Cleanup around TraceEvents in HTTTP.actor.

* Undo commented out cleanup -- for debugging

* formatting

---------

Co-authored-by: stack <stack@duboce.com>
2025-02-13 09:06:17 -08:00
neethuhaneesha
62cc2a3edf
Migration to consider wiggling based on perpetualStorageEngine and not on configureStorageEngine (#11917) 2025-02-12 11:25:16 -08:00
Yao Xiao
76d514bf56
Update shared rocksdb knobs. #11936 (#11938) 2025-02-11 15:41:02 -08:00
Jingyu Zhou
8cd90ee7d0 Holds onto temporary variables' memories
Otherwise, StringRef points to free'ed memory locations.
2025-02-10 14:45:24 -08:00
Zhe Wang
d1efff1511
Improve BulkLoad Implementation (#11929)
* improve bulkload code

* address CI

* disable audit storage replica check and distributed consistency check in bulkload and bulkdump simulation test

* fix ci

* disable waitForQuiescence in bulkload and bulkdump tests
2025-02-06 21:25:49 -08:00
Zhe Wang
0f6fa090ce
Bulkload Engine Support General Storage Engine and Fix BulkLoad Bugs (#11898)
* bulkload support general engine and fix bugs

* add comments

* improve test coverage and fix bug

* nits and address comments

* nit

* nits

* fix data inconsistency bug due to bulkload metadata

* fix ss bulkload task metadata bugs

* nit and fix CI issue

* fix bugs of restore ss bulkload metadata

* use ssBulkLoadMetadata for fetchKey and general kv engine

* cleanup bulkload file for fetchkey

* fix CI issue

* fix simulation stuck due to repeated re-recruitment of unfit dd

* randomly do available space check when finding the dest team for bulkload in simulation

* address conflict

* code clean up

* update BulkDumping.toml same to BulkLoading.toml

* consolidate ss fetchkey and fetchshard failed to read bulkload task metadata

* fix DD bulkload job busy loop bug which causes segfault and test terminate unexpectedly in joshua test

* nit

* fix ss busy loop for bulkload in fetchkey

* use sqlite for bulkload ctest

* fix bulkload ctest stuck issue due to merge and change storage engine to ssd

* fix comments for CC recruit DD

* address comments

* address comments

* add comments

* fix ci format issue

* address comments

* add comments
2025-02-06 12:04:13 -08:00
Zhe Wang
b0ff9187ad
fix shardedrocksdb knob and add ENFORCE_SHARDED_ROCKSDB_SIM_IF_AVALIABLE (#11916) 2025-01-29 23:40:37 -08:00
michael stack
86e45fe31d Allow tht FDB_PIDS may not be set 2025-01-27 22:36:47 -08:00
michael stack
ca543ca8a6 Handling for 'line 16: kill: Binary: arguments must be process or job IDs'
On cleanup after tests, don't fail. Also print PIDs for fdb processes
in case there an issue here.
2025-01-27 16:57:55 -08:00
neethuhaneesha
06cdf2e030
Pause store wiggle if all SS does not have minimum available space. (#11905) 2025-01-24 17:29:23 -08:00
hao fu
93133c83fb address comments 2025-01-23 12:32:27 -08:00
hao fu
933b035729 Refactor backup mutation serialization 2025-01-23 08:52:18 -08:00
flowguru
fe47ce24d3
New restore consolidated commit (#11901)
* New restore consolidated commit

This change adds RestoreDispatchPartitionedTaskFunc to restore
from partitioned-format backup.

* ArenaBlock::totalSize parameter pass by ref

* Fix format issues identified by CI
2025-01-22 14:54:55 -08:00
Jingyu Zhou
b1e43da91e
Merge pull request #11896 from yao-xiao-github/main-cp
Add direct io knob and custom compaction policy.
2025-01-22 12:41:50 -08:00
michael stack
f3c9b66e3b Change how we process array passed to a function -- the bash on test servers seems to behave differently 2025-01-19 15:33:04 -08:00
michael stack
aea37ae90d Use s3 if available when running the bulkload test.
It was disabled until we made it so the SS could
talk to s3, included in this PR.

Also finished the bulkload test. It only had the
bulkdump portion. bulkload support was recentlty
added so finish off the test here by adding bulkload
of the bulkdump and then verifying all data present.

Added passing knobs to the fdb cluster so available to the
fdbserver when it goes to talk to s3. Also added passing
SS count to start in fdb cluster.

* fdbclient/tests/fdb_cluster_fixture.sh
 Add ability to pass multiple knobs to fdb cluster
 and to specify more than just one SS.

* fdbserver/fdbserver.actor.cpp
 Add --blob-server option and processing of FDB_BLOB_CREDENTIALS
 if present (hijacked the unused, unadvertised --
   blob-credentials-file).

* tests/loopback_cluster/run_custom_cluster.sh
 Allow passing more than just one knob.

* fdbclient/BulkLoading.cpp
* fdbclient/include/fdbclient/BulkLoading.h
 Added getPath

* fdbclient/S3BlobStore.actor.cpp
 Fix bug where we were doubling up the first '/' on a path if
 it had a root '/' already (s3 treats /a/b as distinct from
 /a//b).

* fdbclient/S3Client.actor.cpp
 Fix up of traceevent Types.

* fdbclient/tests/bulkload_test.sh
 Enable being able to use s3 if available.
 Pick up jobid when bulkdumping. Feed it to new bulkload
 method. Add verification all data present post-bulkload.

* fdbserver/BulkLoadUtil.actor.cpp
 Add support for blobstore.

* tests/loopback_cluster/run_custom_cluster.sh
 Bug fix -- we were only able to pass in one knob. Allow
 passing multiple.
2025-01-17 17:29:56 -08:00
Yao Xiao
786e2a6093 Add custom compaction policy based on number of range deletions in file
* compaction policy

* fix build error
2025-01-17 14:14:14 -08:00
Yao Xiao
9ca82b2fda Add knob for direct IO 2025-01-17 14:11:35 -08:00
neethuhaneesha
f4c3565aff
Rocksdb manual flush code changes (#11849) 2025-01-17 12:44:44 -08:00
Dan Lambright
78d4490acf
Add ENABLE_VERSION_VECTOR_REPLY_RECOVERY switch (#11889)
Co-authored-by: Dan Lambright <hlambright@apple.com>
2025-01-16 15:10:06 -05:00
Michael Stack
739bd1bfb0
Merge pull request #11864 from saintstack/use_s3
Have ctests use s3 if it is available.
2025-01-15 16:17:17 -08:00
michael stack
f7ee5e52c3 * fdbclient/tests/seaweedfs_fixture.sh
The search for 'address in use' was overly specific. Loosen it up.
2025-01-15 11:43:01 -08:00
Zhe Wang
0bce8bd281
Parallelize Fetching BulkLoad Manifest Metadata (#11884) 2025-01-15 09:28:12 -08:00
Zhe Wang
9195f78bec
Bulkload FDBCLI Command (#11886) 2025-01-15 09:27:59 -08:00
Syed Paymaan Raza
654255d520
Extend gray failure recentHealthTriggeredRecoveryTime state to reflect any recovery
* Extend gray failure recentHealthTriggeredRecoveryTime state to reflect any recovery, including non-gray failure triggered ones

* Update knob documentation

* Add log
2025-01-15 09:12:47 -08:00
michael stack
1da11f94eb Add more variety to the random temp name making; we seem to have
been using an old directory left over which caused start of weed
to fail.

* fdbbackup/tests/s3_backup_test.sh
 Remove unused S3_RESOURCE

* fdbclient/tests/aws_fixture.sh
* fdbclient/tests/seaweedfs_fixture.sh
 Mix in process id into tmp dir name.

* fdbclient/tests/bulkload_test.sh
 Add in (disabled) use s3 code if it available.
2025-01-14 14:49:52 -08:00
michael stack
aef2e6de15 Dump out 1k lines of log instead of 50 so can hopefully see why the failure on test machine 2025-01-14 13:13:15 -08:00
michael stack
4648270906 * fdbbackup/tests/s3_backup_test.sh
Refactor to go against s3 if available.

* fdbclient/tests/aws_fixture.sh
 Add aws_setup utility shared by scripts going against s3.

* fdbclient/tests/bulkload_test.sh
 Comment out verification for now.
 Redo of how we use seaweed (less code).

* fdbclient/tests/fdb_cluster_fixture.sh
 Take knobs when starting backup_agent.

* fdbclient/tests/s3client_test.sh
 Explain the OKTETO_NAMESPACE variable.
 Add logging of whether we are going against s3 or seaweed.
 We don't know certificate and key talking to s3.
 Move common setup code out to aws and weed fixtures.

* fdbclient/tests/seaweedfs_fixture.sh
 Make it so less methods to call running seaweed.
2025-01-14 13:13:15 -08:00
michael stack
03789e2f1c * fdbclient/S3BlobStore.actor.cpp
Fix compile fail.

* fdbclient/tests/aws_fixture.sh
* fdbclient/tests/seaweedfs_fixture.sh
* fdbclient/tests/tests_common.sh
 Rename of local variable so they don't clash
 w/ varibles set by the caller.

* fdbclient/tests/bulkload_test.sh
 Refactoring in preparation for this test to go against s3.
 Currently only works against seaweed.
 (Will do in a follow-on PR. I need to do a bit of work first
 to make this possible).

* fdbclient/tests/s3client_test.sh
 Refactor removing duplicated code.
 Added a test to prove s3 works using old md5 hash; i.e.
 disabled integrity check.
2025-01-14 13:13:15 -08:00
michael stack
4d835c542c Have ctests use s3 if it is available.
Fix object integrity check; original approach doesn't work when
serverside encryption is enabled (awz:kms).

* contrib/SimpleOpt/include/SimpleOpt/SimpleOpt.h
 Address sanitizer was complaining about how SimpleOpt manipulates the
 array of options. While memcpy inside a buffer is 'odd', it seems fine.
 Its old code. Leaving it.

* fdbbackup/tests/s3_backup_test.sh
 Pass in weed_dir rather than rely on fixture global (the latter didn't
 work).

* fdbclient/ClientKnobs.cpp
* fdbclient/include/fdbclient/ClientKnobs.h
* fdbclient/include/fdbclient/S3BlobStore.h
 Add a knob to ask for object integrity check on download from s3.
 BLOBSTORE_ENABLE_OBJECT_INTEGRITY_CHECK replaces BLOBSTORE_ENABLE_ETAG_ON_GET
 which doesn't work when serverside encodes content (found in testing).

* fdbclient/S3BlobStore.actor.cpp
 Implement object integrity check on download. If
 enable_object_integrity_check is set, we use sha256 in place of md5
 as our hash. Removed a redundant 'verify' of md5 check.

* fdbclient/S3Client.actor.cpp
 Remove unhelpful comments.

* fdbclient/S3Client_cli.actor.cpp
 Add support for enable_object_integrity_check. This knob replaces
 enable_etag_on_get which didn't work when awz:kms serverside
 encryption was enabled.
 Add error code on exit when exception.

* fdbclient/include/fdbclient/S3Client.actor.h
 Move an include (address a review comment from previous commit).

* fdbclient/tests/aws_fixture.sh
 Add an aws fixture of utility that can be shared.

* fdbclient/tests/bulkload_test.sh
 Use imported log_test_result

* fdbclient/tests/s3client_test.sh
 Add using s3 if available; otherwise, do seaweedfs.

* fdbclient/tests/seaweedfs_fixture.sh
 WEED_DIR global doesn't work so have caller pass it in for each method
 instead.
2025-01-14 13:13:15 -08:00
Zhe Wang
cf7c8f41b2
BulkLoad Job Framework and Co-Testing BulkLoad and BulkDump (#11865)
* add bulkload job framework and fix bugs

* add BulkLoadChecksum, fix CI issue

* nits

* nits

* address comments

* mitigate perpetual wiggle to make sure DD can select a valid team to inject data

* fix submitBulkDumpJob and submitBulkLoadJob

* change remoteRoot to jobRoot

* add comments
2025-01-14 11:28:42 -08:00
stack
c08b39ea21 Formatting 2025-01-13 09:21:54 -08:00
stack
373d1937e4 Address review comments 2025-01-10 09:53:51 -08:00
michael stack
fd239fcb2d Clarifying documentation on blob backup URL and credentials file.
* documentation/sphinx/source/backups.rst
 Minor edit. Add more examples making it clearer how to do S3
 backup URLs in particular. Explain the 'trick' for omitting
 key, secret, and token from URL instead picking them up from
 the credentils file.

* fdbclient/S3Client_cli.actor.cpp
 Minor cleanup of usage.
2025-01-10 09:30:24 -08:00
Yao Xiao
6811f42735
Add rocksdb version to status json. (#11868)
* Add rocksdb version to status json.

* update schema
2025-01-09 12:56:09 -08:00
Zhe Wang
2bcc27e4d1
fix uninitialized value in BulkLoadManifest (#11869) 2025-01-08 20:23:12 -08:00
michael stack
4c1e74105e Add checksum checking of downloads. Add cleanup of test data.
* fdbclient/ClientKnobs.cpp
* fdbclient/include/fdbclient/ClientKnobs.h
 Add knob BLOBSTORE_ENABLE_ETAG_ON_GET

* fdbclient/S3BlobStore.actor.cpp
 Optionally check etag (md5) volunteered by s3 against the
 content we have downloaded and fail if not equal (TODO:
 check the checksum after we've saved the content to the
 filesystem --  would require  good bit of a refactoring).

* fdbclient/S3Client.actor.cpp
 Add deleteResource support.

* fdbclient/S3Client_cli.actor.cpp
 Add COMMAND support; currently either 'cp' or 'rm'.
 Set the knob blobstore_enable_etag_on_get to true by
 default for s3client.

* fdbclient/tests/s3client_test.sh
 Add clean up of resources written up to s3 at end of test.
 (Awkward in bash)
2025-01-06 13:50:19 -08:00
Zhe Wang
d3532e4478
Improve BulkLoad/Dump implementation (#11842)
* Improve BulkLoad/Dump implementation

* make bulkload test data folder inside simfdb folder

* simplify code

* use manifest in bulkdump metadata

* use manifest in bulkload

* apply bulkload fileset to bulkload and fix bugs of bytesampling value generation

* remove BulkDumpFileFullPathSet

* address comments

* address comments

* address comments
2025-01-06 13:02:23 -08:00
Johannes Scheuermann
50ffac87ed
Refactor locality-based exclusion checks to reduce additional overhead (#11838)
* Refactor locality-based exclusion checks to reduce additional overhead

* Update exclusion logic to prevent copies
2024-12-20 20:45:31 +00:00
Jingyu Zhou
d78b4315f8
Fix stack use-after-return bugs (#11846)
Variables before "wait()" are temporary ones that will be destructed in the
actor compiled code. So adding "state" to keep them live while executing the
"wait()" calls.

This is found by ASAN.
2024-12-20 09:11:08 -08:00
Syed Paymaan Raza
c2862b8728
Lower bound version of CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE (#11840) 2024-12-18 23:33:01 -08:00
Syed Paymaan Raza
122cb96b82
Make sharded rocks deterministic in simulation (phase 1) (#11841) 2024-12-18 16:05:04 -08:00
Zhe Wang
83f42e13d9
Make BulkDump work with S3 (#11822)
* init

* Add bulkdump to blobstore:// (s3)

* cmake/CompileBoost.cmake
 Add boost url. Needed parsing blobstore:// urls.

* documentation/sphinx/source/bulkdump.rst
 Minor edit to allow addition of blobstore target.

* fdbcli/BulkDumpCommand.actor.cpp
* fdbclient/BulkDumping.cpp
 s/blobstore/s3/ -- more generic and aligns with
 how backup/restore refers to "s3" thingies.

* fdbclient/include/fdbclient/S3Client.actor.h
* fdbclient/S3Client.actor.cpp
 Add batch upload handler.

* fdbclient/tests/seaweedfs_fixture.sh
 Add  run seaweed method. Also look for
 weed and if installed use it else download.

* fdbserver/BulkDumpUtil.actor.cpp
 appendToPath does the right thing when passed an URL
 Add bulkDumpTransportBlobstore_impl.
 Add upload to blobstore.

* tests/loopback_cluster/run_custom_cluster.sh
 Complain if unrecognized arguments.

* Add ctest for bulkload with simple bulkdump test for now.

* Add new test to ctest list

* fix bugs

* nit

* nits

* nits

---------

Co-authored-by: stack <stack@duboce.com>
2024-12-18 13:29:36 -08:00