foundationdb

mirror of https://github.com/apple/foundationdb.git synced 2025-06-02 11:15:50 +08:00

Author	SHA1	Message	Date
Jingyu Zhou	deda04b845	Fix a restore bug due to a race (#12037 ) Found by simulation: seed: -f tests/slow/ApiCorrectnessAtomicRestore.toml -s 177856328 -b on Commit: 51ad8428e0fbe1d82bc76cf42b1579f51ecf2773 Compiler: clang++ Env: Rhel9 okteto applyMutations() has processed version 801400000-803141392, and before calling sendCommitTransactionRequest(), which was going to update apply begin version to 803141392. But DID NOT wait for the transaction commit. Then there is an update on the apply end version to 845345760, which picks up the PREVIOUS apply begin version 801400000. Thus started another applyMutation() with version range 801400000-845345760. Note because previous applyMutation() has finished and didn't wait for the transaction commit, thus the starting version is wrong. As a result, this applyMutation() re-processed version range 801400000-803141392. The test failed during re-processing, because mutations are missing for the overlapped range. The fix is to wait for the transaction to commit in sendCommitTransactionRequest(). This bug probably affects DR as well. See rdar://146877552 20250317-162835-jzhou-ff4c4d6d7c51bfed	2025-03-17 16:12:33 -07:00
Zhe Wang	0e736c68e7	Allow One BulkloadTask Do Multiple Manifests (#12036 )	2025-03-17 11:45:15 -07:00
Zhe Wang	d5946157f0	avoid shard merge when bulkload (#12035 )	2025-03-15 13:20:51 -07:00
Zhe Wang	eb0d9f2028	Add Verbose Level for BulkLoad Trace Events (#12034 ) * add level for DDBulkLoad except for datadistribution * nits	2025-03-14 19:15:41 -07:00
Zhe Wang	6ae46b4917	BulkLoadJob Should Not Schedule Completed BulkLoadTask (#12030 ) * make bulkload job manager logic clear * bypass task if the task has been completed * improve scheduleBulkLoadJob	2025-03-14 14:52:33 -07:00
Michael Stack	74f447cbd9	More cleanup of bulk* cli (#12015 ) Tighten up options for bulk. Compound 'local' and 'blobstore' as 'dump'/'load'. Ditto for 'history'. Make it so 'bulkload mode' works like 'bulkdump mode': i.e. dumps current mode. If mode is not on for bulk, ERROR in same manner as for writemode. Make it so we can return bulk* subcommand specific help rather than dump all help when an issue. Make the commands match in the ctest	2025-03-13 13:49:53 -07:00
Zhe Wang	10fecd0a4e	Add Error Message To BulkLoadJob Metadata (#12024 ) * add error message to bulkload metadata * remove TODOs and add error message for bulkload job manifest map creation failures * nits	2025-03-13 10:02:39 -07:00
Zhe Wang	529db211b2	persist bulkload task count in bulkload job (#12022 )	2025-03-12 15:35:26 -07:00
neethuhaneesha	1d9f16bf07	Added compaction knobs. (#12018 )	2025-03-12 12:38:23 -07:00
Michael Stack	32f2ef9104	Add checksumming across multipart upload and download (#11988 ) * Hash file before uploading. Add it as tag after successful multipart upload. On download, after the file is on disk, get its hash and compare to that of the tag we get from s3. * fdbclient/CMakeLists.txt Be explicit what s3client needs. * fdbclient/S3BlobStore.actor.cpp * fdbclient/include/fdbclient/S3BlobStore.h Add putObjectTags and getObjectTags * fdbclient/S3Client.actor.cpp Add calculating checksum, adding it as tags on upload, fetching on download, and verifying match if present. Clean up includes. Less logging. * fdbclient/tests/s3client_test.sh Less logging. * Make failed checksum check an error (and mark non-retryable) --------- Co-authored-by: michael stack <stack@duboce.com>	2025-03-11 21:34:59 -07:00
Zhe Wang	51ad8428e0	A Couple for Fixes for BulkDump and RangeLock (#12013 ) * fix lockrange test and improve bulk dump * fix bulkdump stuck error * remove unnecessary yield when read/write bulk files * remove unnecessary string creation in read/write bulk files	2025-03-11 15:58:01 -07:00
Michael Stack	6ee6e0bd7f	Edit of bulkload/bulkdump cli. (#12012 ) * fdbcli/BulkDumpCommand.actor.cpp * fdbcli/BulkLoadCommand.actor.cpp Print out the bulkdump description rather than usage so user has a chance of figuring out what it is they entered incorrectly. Make bulkdump and bulkload align by using 'cancel' instead of 'clear' in both and ordering the sub-commands the same for bulkload and bulkdump. Add more help to the description. Bulkload was missing mention of the jobid needed specifying a bulkload. * documentation/sphinx/source/bulkdump.rst s/clearBulkDumpJob/cancelBulkDumpJob/ Co-authored-by: stack <stack@duboce.com>	2025-03-11 08:52:13 -07:00
Syed Paymaan Raza	610ab21936	Increase TLOG_MAX_CREATE_DURATION in simulation	2025-03-10 19:47:23 -07:00
Zhe Wang	79a38c1dc0	Fix RangeLock in BulkDump Test and Avoid Memory Copy For Async Read/Write Bulk Files (#12007 )	2025-03-10 15:13:29 -07:00
Zhe Wang	21b87ef6c8	Improve Range Lock and Add Documentation (#11986 ) * rangelock doc * nits * fix ci * fix ci * nits * address comments * nits * nit * make read lock exclusive * fix * fix CI * improve doc * fix bug * address simulation failues * fix bugs * nits	2025-03-07 14:11:23 -08:00
Michael Stack	e1138c30ee	Make bulkload file reads and writes async and memory parsimonious (#11997 ) * * fdbclient/S3Client.actor.cpp Change field names so capitialized (convention) Add duration as field to traces. * fdbserver/BulkLoadUtil.actor.cpp When the job-manifest is big, processing blocks so much getBulkLoadJobFileManifestEntryFromJobManifestFile fails. * Make bulkload file reads and writes async and memory parsimonious. In tests at scale, processing a large job-manifest.txt was blocking and causing the bulk job to fail. This is part 1 of two patches. The second is to address data copy added in the below when we made methods ACTORs (ACTOR doesn't allow passing by reference). * fdbserver/BulkDumpUtil.actor.cpp Removed writeStringToFile and buldDumpFileCopy in favor of new methods in BulkLoadUtil. Made hosting functions ACTORs so could wait on async calls. * fdbserver/BulkLoadUtil.actor.cpp Added async read and write functions. * fdbserver/DataDistribution.actor.cpp Making uploadBulkDumpJobManifestFile async made it so big bulkloads work. * fix memory corruption in writeBulkFileBytes and fix read options in getBulkLoadJobFileManifestEntryFromJobManifestFile * If read or write < 1MB, do it in a single read else do multiple read/writes * * packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go Just be blunt and write out the credentials. Trying to figure when the blob credentials have expired is error prone. Co-authored-by: michael stack <stack@duboce.com> Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>	2025-03-06 10:43:04 -08:00
Zhe Wang	8142ebd029	Add BulkLoad History (#11992 ) * add bulkload history * address comments * address comments	2025-03-04 18:50:08 -08:00
neethuhaneesha	5872ef711b	Temporarily disabling backup dry run request until the issue is fixed (#11991 )	2025-03-03 15:52:30 -08:00
Michael Stack	1e1aa71dab	Build a sidecar container that refreshes s3 credentials (#11945 ) * packaging/docker/Dockerfile Add fdb-aws-s3-credentials-fetcher-sidecar container. Runs perpetual script that writes blob-credentials.json to /var/fdb. * packaging/docker/build-images.sh Build and publish new sidecar container * packaging/docker/fdb-aws-s3-credentials-fetcher/README.md * packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go * packaging/docker/fdb-aws-s3-credentials-fetcher/go.mod * packaging/docker/fdb-aws-s3-credentials-fetcher/go.sum Script that fetches credentials via IRSA (IAM Roles for Service Accounts). * packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go Match the key generated by fdbserver internally. * fdbclient/S3BlobStore.actor.cpp Add some logging around fail-to-find-credentials -- why. * * fdbclient/tests/aws_fixture.sh Use the fdb-aws-s3-credentials-fetcher script fetching credentials if available in ctests. * fdbclient/tests/s3client_test.sh TMPDIR might not be defined when we print usage. Co-authored-by: Johannes Scheuermann <johscheuer@users.noreply.github.com>	2025-03-03 08:39:33 -08:00
Jingyu Zhou	add710d7f6	Enable TRACK_TLOG_RECOVERY as default (#11987 ) Test RECORD_RECOVER_AT_IN_CSTATE and TRACK_TLOG_RECOVERY in buggify with random on or off.	2025-03-02 19:15:12 -08:00
Syed Paymaan Raza	6319330d8e	Revert "Update main branch to 8.0 (#11968 )" This reverts commit 710f3f3083b845b0ae5f94b9a2e58eced826f463.	2025-02-28 13:31:40 -08:00
Zhe Wang	8da2a54f4d	Add BulkloadJob Cancellation (#11976 ) * add bulkload cancellation * reduce frequency of job cancellation in tests * fix bulkload assert failure * nits * fix busy loop in bulkload/dump workload * fix workload * but * address comments and CI failures * add task count trace event	2025-02-27 20:34:53 +00:00
Syed Paymaan Raza	710f3f3083	Update main branch to 8.0 (#11968 )	2025-02-26 14:09:52 -08:00
Zhe Wang	2116547ad3	Improve BulkDump Implementation (#11974 ) * bulkdump code refactor * fix bugs * improve	2025-02-26 13:58:45 -08:00
Zhe Wang	5cce92dcac	Simplify BulkLoad Job Metadata (#11959 ) * address comments in the PR 11952 * code refactor and simplification * avoid task outdated in DDBulkLoadJobExecute * nit * fix CI issue	2025-02-25 10:57:22 -08:00
Yao Xiao	67b9b5c9f3	Remove per thread histogram in storage engine and fix bugs in range scan. (#11967 )	2025-02-25 10:52:46 -08:00
Zhe Wang	94faec13d5	Enable BulkLoad Job to Give Up Unretrievable Task and Fix DDStuck Bug (#11952 ) * enable bulkload job to give up unretriable task * fix ddstuck bug	2025-02-17 17:27:32 -08:00
Jingyu Zhou	1bd6f0aeab	Save NOOP progress of backup workers This is needed so that CC knows the lower bound of versions that can be included in a backup.	2025-02-17 09:50:29 -08:00
Zhe Wang	d141eea3e1	Allow BulkLoadEngine to Handle Non-Retriable Task (#11950 ) * enable-bulkload-engine-accept-unretriable-task * nit and fmt * fix bug	2025-02-14 10:52:29 -08:00
Zhe Wang	e070698ed0	DataMove Should Decide BulkLoading After Old DataMove Actor Has Been Cleared (#11947 ) * fix bulkload bug * fix CI	2025-02-13 15:35:55 -08:00
Jingyu Zhou	6a9898de44	Merge pull request #11904 from flowguru/backup1 Refactor backup mutation serialization	2025-02-13 14:18:04 -08:00
Michael Stack	ff22876247	Add multiparting to s3client. (#11920 ) * Add multiparting to s3client. Fix boost::urls::parse_uri 's dislike of credentialed blobstore urls. * fdbclient/BulkLoading.cpp Add blobstore regex to extract credentials before feeding the boost parse_uri. * fdbclient/include/fdbclient/S3BlobStore.h * fdbclient/S3BlobStore.actor.cpp Add cleanup of failed multipart -- abortMultiPartUpload l(s3 will do this in the background eventually but lets clean up after ourselves). Also add getObjectRangeMD5 so can do multipart checksumming. * fdbclient/S3Client.actor.cpp Change upload file and download file to do multipart always. Retry too. * fdbclient/S3Client_cli.actor.cpp Add command line to trace rather than output. * Address Zhe review * More logging around part upload and download * Undo assert that proved incorrect; restore the old length math doing copy in readObject. Cleanup around TraceEvents in HTTTP.actor. * Undo commented out cleanup -- for debugging * formatting --------- Co-authored-by: stack <stack@duboce.com>	2025-02-13 09:06:17 -08:00
neethuhaneesha	62cc2a3edf	Migration to consider wiggling based on perpetualStorageEngine and not on configureStorageEngine (#11917 )	2025-02-12 11:25:16 -08:00
Yao Xiao	76d514bf56	Update shared rocksdb knobs. #11936 (#11938 )	2025-02-11 15:41:02 -08:00
Jingyu Zhou	8cd90ee7d0	Holds onto temporary variables' memories Otherwise, StringRef points to free'ed memory locations.	2025-02-10 14:45:24 -08:00
Zhe Wang	d1efff1511	Improve BulkLoad Implementation (#11929 ) * improve bulkload code * address CI * disable audit storage replica check and distributed consistency check in bulkload and bulkdump simulation test * fix ci * disable waitForQuiescence in bulkload and bulkdump tests	2025-02-06 21:25:49 -08:00
Zhe Wang	0f6fa090ce	Bulkload Engine Support General Storage Engine and Fix BulkLoad Bugs (#11898 ) * bulkload support general engine and fix bugs * add comments * improve test coverage and fix bug * nits and address comments * nit * nits * fix data inconsistency bug due to bulkload metadata * fix ss bulkload task metadata bugs * nit and fix CI issue * fix bugs of restore ss bulkload metadata * use ssBulkLoadMetadata for fetchKey and general kv engine * cleanup bulkload file for fetchkey * fix CI issue * fix simulation stuck due to repeated re-recruitment of unfit dd * randomly do available space check when finding the dest team for bulkload in simulation * address conflict * code clean up * update BulkDumping.toml same to BulkLoading.toml * consolidate ss fetchkey and fetchshard failed to read bulkload task metadata * fix DD bulkload job busy loop bug which causes segfault and test terminate unexpectedly in joshua test * nit * fix ss busy loop for bulkload in fetchkey * use sqlite for bulkload ctest * fix bulkload ctest stuck issue due to merge and change storage engine to ssd * fix comments for CC recruit DD * address comments * address comments * add comments * fix ci format issue * address comments * add comments	2025-02-06 12:04:13 -08:00
Zhe Wang	b0ff9187ad	fix shardedrocksdb knob and add ENFORCE_SHARDED_ROCKSDB_SIM_IF_AVALIABLE (#11916 )	2025-01-29 23:40:37 -08:00
michael stack	86e45fe31d	Allow tht FDB_PIDS may not be set	2025-01-27 22:36:47 -08:00
michael stack	ca543ca8a6	Handling for 'line 16: kill: Binary: arguments must be process or job IDs' On cleanup after tests, don't fail. Also print PIDs for fdb processes in case there an issue here.	2025-01-27 16:57:55 -08:00
neethuhaneesha	06cdf2e030	Pause store wiggle if all SS does not have minimum available space. (#11905 )	2025-01-24 17:29:23 -08:00
hao fu	93133c83fb	address comments	2025-01-23 12:32:27 -08:00
hao fu	933b035729	Refactor backup mutation serialization	2025-01-23 08:52:18 -08:00
flowguru	fe47ce24d3	New restore consolidated commit (#11901 ) * New restore consolidated commit This change adds RestoreDispatchPartitionedTaskFunc to restore from partitioned-format backup. * ArenaBlock::totalSize parameter pass by ref * Fix format issues identified by CI	2025-01-22 14:54:55 -08:00
Jingyu Zhou	b1e43da91e	Merge pull request #11896 from yao-xiao-github/main-cp Add direct io knob and custom compaction policy.	2025-01-22 12:41:50 -08:00
michael stack	f3c9b66e3b	Change how we process array passed to a function -- the bash on test servers seems to behave differently	2025-01-19 15:33:04 -08:00
michael stack	aea37ae90d	Use s3 if available when running the bulkload test. It was disabled until we made it so the SS could talk to s3, included in this PR. Also finished the bulkload test. It only had the bulkdump portion. bulkload support was recentlty added so finish off the test here by adding bulkload of the bulkdump and then verifying all data present. Added passing knobs to the fdb cluster so available to the fdbserver when it goes to talk to s3. Also added passing SS count to start in fdb cluster. * fdbclient/tests/fdb_cluster_fixture.sh Add ability to pass multiple knobs to fdb cluster and to specify more than just one SS. * fdbserver/fdbserver.actor.cpp Add --blob-server option and processing of FDB_BLOB_CREDENTIALS if present (hijacked the unused, unadvertised -- blob-credentials-file). * tests/loopback_cluster/run_custom_cluster.sh Allow passing more than just one knob. * fdbclient/BulkLoading.cpp * fdbclient/include/fdbclient/BulkLoading.h Added getPath * fdbclient/S3BlobStore.actor.cpp Fix bug where we were doubling up the first '/' on a path if it had a root '/' already (s3 treats /a/b as distinct from /a//b). * fdbclient/S3Client.actor.cpp Fix up of traceevent Types. * fdbclient/tests/bulkload_test.sh Enable being able to use s3 if available. Pick up jobid when bulkdumping. Feed it to new bulkload method. Add verification all data present post-bulkload. * fdbserver/BulkLoadUtil.actor.cpp Add support for blobstore. * tests/loopback_cluster/run_custom_cluster.sh Bug fix -- we were only able to pass in one knob. Allow passing multiple.	2025-01-17 17:29:56 -08:00
Yao Xiao	786e2a6093	Add custom compaction policy based on number of range deletions in file * compaction policy * fix build error	2025-01-17 14:14:14 -08:00
Yao Xiao	9ca82b2fda	Add knob for direct IO	2025-01-17 14:11:35 -08:00
neethuhaneesha	f4c3565aff	Rocksdb manual flush code changes (#11849 )	2025-01-17 12:44:44 -08:00

1 2 3 4 5 ...

7600 Commits