foundationdb

mirror of https://github.com/apple/foundationdb.git synced 2025-04-19 14:51:20 +08:00

Author	SHA1	Message	Date
Dan Lambright	9e08874463	Disable enable_version_vector_reply_recovery in version vector tests. (#12032 ) Co-authored-by: Dan Lambright <hlambright@apple.com>	2025-03-18 19:52:29 -04:00
Jingyu Zhou	deda04b845	Fix a restore bug due to a race (#12037 ) Found by simulation: seed: -f tests/slow/ApiCorrectnessAtomicRestore.toml -s 177856328 -b on Commit: 51ad8428e0fbe1d82bc76cf42b1579f51ecf2773 Compiler: clang++ Env: Rhel9 okteto applyMutations() has processed version 801400000-803141392, and before calling sendCommitTransactionRequest(), which was going to update apply begin version to 803141392. But DID NOT wait for the transaction commit. Then there is an update on the apply end version to 845345760, which picks up the PREVIOUS apply begin version 801400000. Thus started another applyMutation() with version range 801400000-845345760. Note because previous applyMutation() has finished and didn't wait for the transaction commit, thus the starting version is wrong. As a result, this applyMutation() re-processed version range 801400000-803141392. The test failed during re-processing, because mutations are missing for the overlapped range. The fix is to wait for the transaction to commit in sendCommitTransactionRequest(). This bug probably affects DR as well. See rdar://146877552 20250317-162835-jzhou-ff4c4d6d7c51bfed	2025-03-17 16:12:33 -07:00
Vishesh Yadav	43938d91aa	Add more gRPC/TLS tests	2025-03-17 12:16:06 -07:00
Vishesh Yadav	38b7d6ff66	Implement TLS support for Flow/gRPC This patch adds TLS support for GrpcServer and AsyncGrpcClient by implementing `GrpcCredentialsProvider` and using that to get channel credentials. It adds `FlowGrpc` which is a flow global instance, and initializes TLS credentials that are consistent with the ones provided to FlowTransport. - Added `FlowGrpc` to manage gRPC server initialization and TLS configuration globally. - `GrpcCredentialsProvider` abstracts secure/insecure communications configurations for server/clients. - Introduced `GrpcTlsCredentialProvider` for dynamic TLS certificate reloading from filesystem and `GrpcTlsCredentialStaticProvider` for static in-memory credentials. - Updated `GrpcServer` to accept a `GrpcCredentialProvider`, enabling dynamic TLS credential management. - Modified `fdbserver` to use `FlowGrpc::init()` for gRPC server initialization instead of `GrpcServer::initInstance()`, aligning it with FlowTransport behavior. - Modified `GrpcServer::run()` to use the provided `GrpcCredentialProvider` instead of hardcoded insecure credentials. Testing: - Implemented a basic mTLS test case (`/fdbrpc/grpc/basic_tls`) to verify secure gRPC connections using `GrpcTlsCredentialStaticProvider`. Todo: - Generate certificates during testruns instead statically. - Add test for `GrpcTlsCredentialProvider` which reads keys/certs from filesystem and monitors changes. - Verify peers rules/criterias like FDB --verify-peer feature.	2025-03-17 12:16:06 -07:00
Zhe Wang	0e736c68e7	Allow One BulkloadTask Do Multiple Manifests (#12036 )	2025-03-17 11:45:15 -07:00
Zhe Wang	d5946157f0	avoid shard merge when bulkload (#12035 )	2025-03-15 13:20:51 -07:00
Michael Stack	911fdb4eaf	Add a bulkload user guide (#12033 ) * Add a bulkload user guide * Forgot to add a file * Address review comments --------- Co-authored-by: stack <stack@duboce.com>	2025-03-15 13:04:59 -07:00
Zhe Wang	eb0d9f2028	Add Verbose Level for BulkLoad Trace Events (#12034 ) * add level for DDBulkLoad except for datadistribution * nits	2025-03-14 19:15:41 -07:00
Zhe Wang	6ae46b4917	BulkLoadJob Should Not Schedule Completed BulkLoadTask (#12030 ) * make bulkload job manager logic clear * bypass task if the task has been completed * improve scheduleBulkLoadJob	2025-03-14 14:52:33 -07:00
Vishesh Yadav	bc4dec8e5d	Fix use-after-move issue in AsyncTaskExecutor `getFuture()` should be called before post as `send`/`sendError` operation in `ThreadReturnPromise` moves the underlying Promise to `tagAndForward()`. Ideally, `ThreadReturnPromise` behavior should stay consistent with the `Promise`. However, the problem is that it relies on the invariant that there will always be one owner of its internal `Promise` which is either itself or `tagOrForward` -- which is necessary to ensure that only one thread can operate on the Promise's internal state (ref count, flags etc) and avoid race conditions. This patch (1) makes sure that in case of `post()` function we get future before, (2) adds an ASSERT as this should never happen, (3) documentation for future users and (4) a test case for potentially fixing this in future.	2025-03-14 14:11:22 -07:00
Zhe Wang	9f5fdd0bea	Add BulkLoad Task Count to BulkLoad FDBCLI Command (#12029 ) * change a event name * add bulkload task count to fdbcli * nit	2025-03-13 21:07:47 -07:00
Syed Paymaan Raza	bc8eca15e1	Initialize lastShardMove for recovery txn and in CommitBatchContext (#12027 )	2025-03-13 15:25:46 -07:00
Dan Lambright	96be535a1f	ENABLE_VERSION_VECTOR_REPLY_RECOVERY can be T only if ENABLE_VERSION_VECTOR_TLOG_UNICAST is T (#12021 ) * ENABLE_VERSION_VECTOR_REPLY_RECOVERY can be T only if ENABLE_VERSION_VECTOR_TLOG_UNICAST is T * Respond to review comments --------- Co-authored-by: Dan Lambright <hlambright@apple.com>	2025-03-13 18:15:13 -04:00
Michael Stack	74f447cbd9	More cleanup of bulk* cli (#12015 ) Tighten up options for bulk. Compound 'local' and 'blobstore' as 'dump'/'load'. Ditto for 'history'. Make it so 'bulkload mode' works like 'bulkdump mode': i.e. dumps current mode. If mode is not on for bulk, ERROR in same manner as for writemode. Make it so we can return bulk* subcommand specific help rather than dump all help when an issue. Make the commands match in the ctest	2025-03-13 13:49:53 -07:00
Zhe Wang	10fecd0a4e	Add Error Message To BulkLoadJob Metadata (#12024 ) * add error message to bulkload metadata * remove TODOs and add error message for bulkload job manifest map creation failures * nits	2025-03-13 10:02:39 -07:00
Zhe Wang	529db211b2	persist bulkload task count in bulkload job (#12022 )	2025-03-12 15:35:26 -07:00
neethuhaneesha	15e35ed3a1	Adding rocksdb obsolete files size property in metrics. (#12017 )	2025-03-12 15:11:34 -07:00
Sreenath Bodagala	56402dbbf1	Extend the unicast based recovery algorithm to do the replication policy check (#11996 ) * - Extend the unicast based recovery algorithm to do the replication policy check * - Review comments related changes * - Review and compilation related changes	2025-03-12 18:01:38 -04:00
Syed Paymaan Raza	f40a4cfdad	Add replica comparison wrong_shard_server trace event (#12020 ) * Add replica comparison wrong_shard_server trace event * Suppress trace for 1 sec	2025-03-12 13:49:45 -07:00
neethuhaneesha	1d9f16bf07	Added compaction knobs. (#12018 )	2025-03-12 12:38:23 -07:00
Syed Paymaan Raza	a9deb3ef6b	Request reboot for TSS data move conflicts in simulation (#12008 ) * Request reboot for TSS data move conflicts in simulation * Add comment * Update storageserver.actor.cpp Co-authored-by: Jingyu Zhou <jingyuzhou@gmail.com> --------- Co-authored-by: Jingyu Zhou <jingyuzhou@gmail.com>	2025-03-12 11:56:54 -07:00
Michael Stack	32f2ef9104	Add checksumming across multipart upload and download (#11988 ) * Hash file before uploading. Add it as tag after successful multipart upload. On download, after the file is on disk, get its hash and compare to that of the tag we get from s3. * fdbclient/CMakeLists.txt Be explicit what s3client needs. * fdbclient/S3BlobStore.actor.cpp * fdbclient/include/fdbclient/S3BlobStore.h Add putObjectTags and getObjectTags * fdbclient/S3Client.actor.cpp Add calculating checksum, adding it as tags on upload, fetching on download, and verifying match if present. Clean up includes. Less logging. * fdbclient/tests/s3client_test.sh Less logging. * Make failed checksum check an error (and mark non-retryable) --------- Co-authored-by: michael stack <stack@duboce.com>	2025-03-11 21:34:59 -07:00
Zhe Wang	51ad8428e0	A Couple for Fixes for BulkDump and RangeLock (#12013 ) * fix lockrange test and improve bulk dump * fix bulkdump stuck error * remove unnecessary yield when read/write bulk files * remove unnecessary string creation in read/write bulk files	2025-03-11 15:58:01 -07:00
Dan Lambright	8c6f8c1403	Track shard moves for version vector (#11977 ) * Track shard moves for version vector * Don't broadcast to all TL when a different CP had a metadata mutation, unless on shard moves * update lastShardMove on resolver * Respond to review comments --------- Co-authored-by: Dan Lambright <hlambright@apple.com>	2025-03-11 13:19:57 -04:00
Michael Stack	6ee6e0bd7f	Edit of bulkload/bulkdump cli. (#12012 ) * fdbcli/BulkDumpCommand.actor.cpp * fdbcli/BulkLoadCommand.actor.cpp Print out the bulkdump description rather than usage so user has a chance of figuring out what it is they entered incorrectly. Make bulkdump and bulkload align by using 'cancel' instead of 'clear' in both and ordering the sub-commands the same for bulkload and bulkdump. Add more help to the description. Bulkload was missing mention of the jobid needed specifying a bulkload. * documentation/sphinx/source/bulkdump.rst s/clearBulkDumpJob/cancelBulkDumpJob/ Co-authored-by: stack <stack@duboce.com>	2025-03-11 08:52:13 -07:00
Syed Paymaan Raza	610ab21936	Increase TLOG_MAX_CREATE_DURATION in simulation	2025-03-10 19:47:23 -07:00
Zhe Wang	79a38c1dc0	Fix RangeLock in BulkDump Test and Avoid Memory Copy For Async Read/Write Bulk Files (#12007 )	2025-03-10 15:13:29 -07:00
Jingyu Zhou	91acbbc0a5	Merge pull request #12010 from jzhou77/fix Set max_read_transaction_life_versions for KillRegionCycle.toml	2025-03-10 13:59:08 -07:00
Jingyu Zhou	082cded30a	Set max_read_transaction_life_versions for KillRegionCycle.toml Simulation found an assertion failure in SS: ASSERT(rollbackVersion >= data->storageVersion()); The reason is that storage version is updated to a version larger than the forced recovery version, due to only 1'000'000 for max_read_transaction_life_versions. Also added debugging for cumulative checksum mutations. See rdar://144550725 20250309-185039-jzhou-5145c65b0e8071b7	2025-03-09 11:37:46 -07:00
Zhe Wang	6156975979	Improve BulkLoad Test Coverage And Fix Bugs (#12009 )	2025-03-08 20:26:51 -08:00
Jingyu Zhou	0a9940aa23	Merge pull request #12005 from vishesh/grpc-sim Handle exceptions in `AsyncTaskExecutor` and don't use pointers for ThreadPromise in `AsyncTaskExecutor`	2025-03-08 11:05:02 -08:00
Syed Paymaan Raza	beba524f48	Never absorb wrong_shard_server in LoadBalance replicaComparison (#12006 ) * Never absorb wrong_shard_server in LoadBalance replicaComparison * Add comment * Throw wrong_shard_server() instead of Error(error_code_wrong_shard_server)	2025-03-07 14:41:37 -08:00
Zhe Wang	21b87ef6c8	Improve Range Lock and Add Documentation (#11986 ) * rangelock doc * nits * fix ci * fix ci * nits * address comments * nits * nit * make read lock exclusive * fix * fix CI * improve doc * fix bug * address simulation failues * fix bugs * nits	2025-03-07 14:11:23 -08:00
Vishesh Yadav	4b9cb2f25e	ThreadReturnPromise* in AsyncTaskExecutor don't need to be pointers	2025-03-06 19:00:23 -08:00
Vishesh Yadav	4836a2e9ff	Handle Exceptions in AsyncTaskExecutor Forwards FDB's `Error` type thrown by tasks in `AsyncTaskExecutor`. Any other kind of exception is forwarded as `unknown_error()`.	2025-03-06 17:31:20 -08:00
Jingyu Zhou	cbb605d282	Merge pull request #12003 from vishesh:flow-bug Hold `ThreadReturnPromiseStream` reference when sending value/error	2025-03-06 13:47:34 -08:00
Jingyu Zhou	9542f37aec	Merge pull request #12004 from vishesh:promise-move Delete copy constructor and add move constructors for `ThreadReturnPromise*` and delete copy constructors	2025-03-06 13:38:58 -08:00
Michael Stack	e1138c30ee	Make bulkload file reads and writes async and memory parsimonious (#11997 ) * * fdbclient/S3Client.actor.cpp Change field names so capitialized (convention) Add duration as field to traces. * fdbserver/BulkLoadUtil.actor.cpp When the job-manifest is big, processing blocks so much getBulkLoadJobFileManifestEntryFromJobManifestFile fails. * Make bulkload file reads and writes async and memory parsimonious. In tests at scale, processing a large job-manifest.txt was blocking and causing the bulk job to fail. This is part 1 of two patches. The second is to address data copy added in the below when we made methods ACTORs (ACTOR doesn't allow passing by reference). * fdbserver/BulkDumpUtil.actor.cpp Removed writeStringToFile and buldDumpFileCopy in favor of new methods in BulkLoadUtil. Made hosting functions ACTORs so could wait on async calls. * fdbserver/BulkLoadUtil.actor.cpp Added async read and write functions. * fdbserver/DataDistribution.actor.cpp Making uploadBulkDumpJobManifestFile async made it so big bulkloads work. * fix memory corruption in writeBulkFileBytes and fix read options in getBulkLoadJobFileManifestEntryFromJobManifestFile * If read or write < 1MB, do it in a single read else do multiple read/writes * * packaging/docker/fdb-aws-s3-credentials-fetcher/fdb-aws-s3-credentials-fetcher.go Just be blunt and write out the credentials. Trying to figure when the blob credentials have expired is error prone. Co-authored-by: michael stack <stack@duboce.com> Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>	2025-03-06 10:43:04 -08:00
dependabot[bot]	7f7de83cb0	Bump jinja2 from 3.1.5 to 3.1.6 in /flow/protocolversion (#12002 ) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-03-06 08:38:01 -08:00
Vishesh Yadav	6740eb2a2f	Add move constructors for `ThreadReturnPromise` and delete copy constructors Copy-constructor can be added back if necessary. Meanwhile, its simpler to enforce only copy of ThreadReturnPromise family, and avoid scattering it all over places.	2025-03-06 07:54:30 -08:00
Vishesh Yadav	55edaf5ed1	Hold `ThreadReturnPromiseStream` reference when sending value/error When a value/error is sent via `ThreadReturnPromiseStream` we assume that the underlying `PromiseStream` will be alive when the client waits. However, if the last `ThreadReturnPromiseStream` gets destroyed after sending values/end_of_stream(), the underlying `PromiseStream` will as well resulting in `broken_promise`. This happens because the actual work of sending the value/error is deferred on the main thread. This is likely to happen because the sender did its work and it isn't supposed to check if client got the value. Hence, little reason to keep the promise. Meanwhile, client is free to read values from its future whenever it needs to. This patch just holds the reference to underlying `NotifiedQueue` by copying `PromiseStream` until the value/error is sent. The test added would fail without this patch.	2025-03-06 02:00:34 -08:00
Jingyu Zhou	2ef72cb649	Merge pull request #11984 from vishesh/grpc-sim grpc: AsyncTaskExecutor and basic server lifecycle	2025-03-05 22:18:22 -08:00
Syed Paymaan Raza	61c9a81257	Reduce some parameter values for StoreFrontTest (#11998 )	2025-03-05 17:41:31 -08:00
Jingyu Zhou	8b0c36924d	Update 7.3.63 as the stable latest release (#11999 )	2025-03-05 17:20:20 -08:00
Vishesh Yadav	6329672513	gRPC server life-cycle management and AsyncTaskExecution This patch has two set of changes: - Whenever a service is registered and removed from server, we need to restart gRPC server. GrpcServer provides some methods that can be used by worker actors so that the life of services registered by them can tied to the life of the worker role itself. - Replace asio::thread_pool with AsyncTaskExecutor both in client and server.	2025-03-05 15:17:30 -08:00
Vishesh Yadav	60a0276e08	AsyncTaskExecutor: lightweight wrapper for `IThreadPool` This patch implements `AsyncTaskExecutor` for asynchronous execution of tasks in a separate thread pool. We already have `IThreadPool` however its API is more well suited for bigger tasks. This just provides an easier to use API. There is `AsyncTaskThread` which is similar in nature, but this is not re-wrapping IThreadPool hence has ability to have multiple worker threads. We can potentially replace that with this component by setting `num_threads = 1`. TODO: Move this to `flow/include` instead of here.	2025-03-05 15:17:30 -08:00
Zhe Wang	2e47e97613	Fix DCC tester (#11995 )	2025-03-05 13:41:08 -08:00
Zhe Wang	8142ebd029	Add BulkLoad History (#11992 ) * add bulkload history * address comments * address comments	2025-03-04 18:50:08 -08:00
Dan Lambright	8bba38b180	Version vector: compute locations only once during commits. (#11924 ) * During commits with version vector enabled, compute location list only once, as recalcuating could generate a different random number, hence a different set of locations. * Respond to review comments. * Select replicas from locations returned from resolver. * Respond to review comments --------- Co-authored-by: Dan Lambright <hlambright@apple.com>	2025-03-04 17:09:06 -05:00
neethuhaneesha	5872ef711b	Temporarily disabling backup dry run request until the issue is fixed (#11991 )	2025-03-03 15:52:30 -08:00

1 2 3 4 5 ...

27838 Commits