20866 Commits

Author SHA1 Message Date
Zhe Wu
3cb587edfb Remove explicit degraded peer recovery since this may be false positive 2022-06-23 09:38:27 -07:00
A.J. Beamon
c2752dc773
Merge pull request #7437 from sfc-gh-ajbeamon/copyto-fix
Add check to avoid using memcpy with an invalid data pointer
2022-06-23 08:58:24 -07:00
Vaidas Gasiunas
e28a8401fb
Update coordinator list from cluster file (#7382)
* Log failed connection attempts in monitorProxies

* Update coordinator list from the cluster file after failing to connect to all coordinators

* Wiggle and upgrade test with legacy version monitoring; updating tests to use 7.1.9

* Update coordinator list from the cluster file: addressing review comments

* Update coordinator list from the cluster file: addressing review comments

* Wait on future for all setAndPersistConnectionString calls
2022-06-23 09:22:09 +02:00
A.J. Beamon
485df52b0f Add check to avoid using memcpy with an invalid data pointer 2022-06-22 15:57:47 -07:00
Lukas Joswiak
75423a100c Move shared_ptr to save a reference increment and decrement 2022-06-22 14:50:17 -07:00
Lukas Joswiak
4f2b1807e4 Use shared_ptr to track initialization across threads 2022-06-22 14:50:17 -07:00
Lukas Joswiak
1b1a9d4df5 Initialize on main thread 2022-06-22 14:50:17 -07:00
Lukas Joswiak
88557d9169 Simplify function call when transaction is null 2022-06-22 14:50:17 -07:00
Lukas Joswiak
b80ed948f1 Check initialization status before accessing field 2022-06-22 14:50:17 -07:00
Lukas Joswiak
4c2bb0b44e Fix undefined behavior from accessing field of uninitialized object 2022-06-22 14:50:17 -07:00
Johannes Scheuermann
7bff4af14a
Initial support for Prometheus endpoint and pprof for debugging (#7359)
* Initial support for Prometheus endpoint and pprof for debugging
2022-06-22 08:07:48 +01:00
Ata E Husain Bohra
e1ca0ef9a2
Defer recoveredDiskFiles wait if Encryption data at-rest is enabled (#7414)
* Defer recoveredDiskFiles wait if Encryption data at-rest is enabled

Description

In the current code ClusterController startup wait for 'recoveredDiskFiles'
future to complete before triggered 'clusterControllerCore' actor, which
inturn starts 'EncryptKeyProxy' (EKP) actor resposible to fetch/refresh
encryption keys needed for ClusterRecovery as well interactions with
KMS.

Patch addresses a circular dependency where StorageServer initialization
depends on EKP, but, CC doesn't recruit EKP till 'recoveredDiskFiles' completes
which includes SS initialization. Given 'recoveredDiskFiles' is an optimization,
the patch proposes deferring the 'recoveredDiskFiles' future completion until
new Master recruitment is done as part of ClusterRecovery (unblock EKP singleton)

Testing

Ran 500K correctness runs: 20220618-055310-ahusain-foundationdb-61c431d467557551
Recorded failures doesn't seems to be related to the change.
2022-06-21 18:18:57 -07:00
Bharadwaj V.R
8cf2be030f
Build a TenantCache for use by DD (#7207)
* Add an DD tenant-cache-assembly actor
* Add basic tenant list monitoring for tenant cache. 
* Update DD tenant cache refresh to be more efficient and unit-testable
* Remove the DD prefix in the tenant cache class name (and associated impl and UT class names); there is nothing specific to DD in it; DD uses it; other modules may use it in the future
* Disable DD tenant awareness by default
2022-06-21 16:29:30 -07:00
Lukas Joswiak
9ca8a3c683 Reenable status json for dynamic knobs, add unit test 2022-06-21 11:43:05 -07:00
Johannes Scheuermann
4b0c4a32b0
Add testing for Kubernetes sidecar (#7105)
* Refactor python sidecar and add unit tests

* Fix issue trying to send error response multiple times

* Fix imports and TLS handling

* Correct config variable in ssl reload
2022-06-21 19:39:53 +01:00
Johannes M. Scheuermann
8da4bb9d07 Add openssl for debugging in container image 2022-06-21 12:14:06 -05:00
Dan Lambright
c48d569024
fix a fault injection bug in txn store recovery (#7405)
* fix a fault injection bug in txn store recovery

* Update LogSystemDiskQueueAdapter.actor.cpp

typo

* recoverLoc can be overwritten, so on reset use the stored range start
2022-06-21 12:33:58 -04:00
Josh Slocum
34e6a8f942
Merge pull request #7399 from sfc-gh-jslocum/bg_tenant_improvements
Bg tenant improvements
2022-06-17 11:19:41 -05:00
Markus Pilman
5aacaf891c
Merge pull request #7321 from sfc-gh-ajbeamon/multiple-tenant-creation
Support creating multiple tenants in the same transaction
2022-06-17 10:10:09 -06:00
Josh Slocum
1cc466e068 fixes and review comments 2022-06-17 08:17:44 -05:00
Trevor Clinkenbeard
b7e4d5440d
Merge pull request #7369 from sfc-gh-tclinkenbeard/fix-unused-var
Fix `-Wunused-but-set-variable` warning in `DDSketchBase::percentile`
2022-06-16 23:42:11 -07:00
sfc-gh-tclinkenbeard
111e28d0ea Merge remote-tracking branch 'origin/main' into fix-unused-var 2022-06-16 17:20:18 -07:00
Xiaoxi Wang
6bb4e341f9
Merge pull request #7110 from sfc-gh-xwang/features/ppw-pause-state
Adding paused/running wiggling status to status json and also the last running/paused timestamp
2022-06-16 14:27:18 -07:00
Xiaoxi Wang
a311cc28cc solve some comments 2022-06-16 11:07:21 -07:00
Josh Slocum
b3597ef3a8 Added plumbing for tenant-aware purge granules 2022-06-16 13:04:34 -05:00
Jon Fu
b891b424ea
Merge pull request #7387 from sfc-gh-jfu/jfu-mako-tenant-rows
When provided tenants in mako, divide number of rows by number of tenants.
2022-06-16 12:55:44 -04:00
Andrew Noyes
83aceb216c
Use absl::GetStackTrace for slow task profiler (#7374)
* Make SlowTask workload runnable in joshua

* Remove SignalSafeUnwind, and use absl::GetStackTrace for slow task profiler
2022-06-15 14:53:52 -07:00
Andrew Noyes
0fea3fb731
Save a bunch of copies in the trace thread (#7392)
Currently, a std::string is copied unnecessarily for every key and value
in a trace event.

This actually showed up in a jemalloc heap profile while I was
investigating something unrelated. I was surprised to see it since these
allocations should have a very short lifetime.
2022-06-15 12:29:15 -07:00
Jon Fu
a96928be2d Merge branch 'main' of github.com:apple/foundationdb into jfu-mako-tenant-rows 2022-06-15 12:15:23 -07:00
Johannes Scheuermann
c9b4ff3302
Add support for lumberjack in logging and update example to 7.1 (#7357) 2022-06-15 19:41:01 +01:00
Jingyu Zhou
a32127a0b9
Merge pull request #7391 from sbodagala/main
Do not always try to figure out the sequencer locality
2022-06-15 13:50:55 -04:00
Ata E Husain Bohra
9396b691b7
Generate GNU compatible build-id for mockkms golang binary (#7389)
* Generate GNU compatible build-id for mockkms golang binary

Description

 diff-1: Fix compilation issue

Generate GNU compatible build-id for mockkms golang binary
Leverage "cgo" to generate build-id

Testing

Debian package build, verified the GNU build-id
2022-06-15 10:43:46 -07:00
Sreenath Bodagala
2c85bb71c1 - Do not try to figure out the sequencer locality if knob
ENABLE_VERSION_VECTOR_HA_OPTIMIZATION is disabled.
2022-06-15 16:08:31 +00:00
Ata E Husain Bohra
8808d93813
Fix bugs in EncyrptKeyProxy actor (#7388)
Description

Major changes include:
1. GetEncryptByKeyIds cache elements can expire.
2. Update iterator after erasing an element during refresh encryption keys
   operation.

Testing

EncryptKeyProxyTest
2022-06-14 21:22:25 -07:00
Jon Fu
06c8f9068e fix arg parse ordering 2022-06-14 16:45:58 -07:00
Jon Fu
184c266bdf adjust rows calculation after parsing all arguments but before validation 2022-06-14 16:34:39 -07:00
Jon Fu
6999ebc86f When provided tenants, divide number of rows by number of tenants. Adjust population and range reads to account for this scenario 2022-06-14 16:22:07 -07:00
Yao Xiao
7da26db342
[ShardedRocksDB] 4/N Support removeRange. (#7345) 2022-06-14 13:52:03 -07:00
Yi Wu
6246664006
Support encrypting TxnStateStore (#7253)
Adding encryption support for TxnStateStore. It is done by supporting encryption. for KeyValueStoreMemory. The encryption is currently done on operation level when the operations are being write to the underlying log file. See inline comment for the encrypted data format.

This PR depends on #7252. It is part of the effort to support TLog encryption #6942.
2022-06-14 13:26:32 -07:00
Xiaoge Su
21ee76a44d fixup! Reformat source #2 2022-06-14 13:22:18 -07:00
Xiaoge Su
c2676df2f8 fixup! Reformat source 2022-06-14 13:22:18 -07:00
Xiaoge Su
9fb6e5bb05 fixup! Fix the clang error when using std::move
This patch is to fix the compile error

/root/src/fdbclient/S3BlobStore.actor.cpp:410:9: error: moving a local
object in a return statement prevents copy elision
[-Werror,-Wpessimizing-move]
 return std::move(resource);
        ^
/root/src/fdbclient/S3BlobStore.actor.cpp:410:9: note: remove std::move
call here
 return std::move(resource);
        ^~~~~~~~~~        ~
1 error generated.
2022-06-14 13:22:18 -07:00
Vishesh Yadav
fd6f6eb06a
Merge pull request #7364 from sfc-gh-ljoswiak/fixes/unnecessary-transaction-initialization
Remove unnecessary ReadYourWritesTransaction initialization
2022-06-14 11:02:31 -07:00
Xiaoge Su
00b805d8e0 fixup! Reformat source 2022-06-14 10:43:13 -07:00
Xiaoge Su
e493f1c3cd fixup! Add a retry mechanism in changeQuorumChecker and changeQuorum
This is to fix an issue when recovery and change coordinator key happens
together. The issue will occur when:

1. Recovery starts
2. Coordinator key change transaction started
3. During the recovery the coordinator key is read from cluster file and
   stored in the storage server
4. The cluster controller received `ChangeCoordiatorsRequest`, and
   updated the cluster name with the new value.

at this stage, the value related to coordinator key in storage server and
the worker is inconsistent.

5. changeQuorumChecker is called, which will verify such consistency.
   Since they are different, the call is returning failure and the
   caller, which could be a TEST_CASE, fails.

This is a rare race issue, and it is also noticed that when the
recovery/coordinator key change process is done, the database is in a
proper state which allows changeQuorumChecker behave properly. In this
case, a retry mechanism should be sufficiently fix corresponding test
failures.
2022-06-14 10:43:13 -07:00
Junhyun Shim
ed91ab5d54
Work around flow trace's data race bug (#7237)
* Work around flow trace's data race bug

BaseTraceEvent::setNetworkThread() and flushTraceFile[()|Void()]
has a long-standing race condition for traceEventThrottlerCache global
when flushTraceFileVoid() is not called from the network thread.

This race dates back to 2017 (commit hash 80e5fecfe2),
so before the race itself is fixed, work around the problem.

* Remove call to flushTraceFileVoid() from MkCertCli

* Apply clang format
2022-06-14 12:09:34 +02:00
Yao Xiao
ddbecb69ad
Ignore ShardedRocksDBTest. (#7381) 2022-06-13 23:13:11 -07:00
Renxuan Wang
839af5701e
Fix bug in resolveTCPEndpoint() when hostname resolving fails. (#7375)
* Close trace file when error happens in runNetwork().

* Improve the bestCount algorithm in getLeader().

In the current implementation, if the nominees are [0,1], the chosen leader will be 1, which is an exception to other cases and our expectation that if 2 nominees have the same frequency, the one with lower id will be the leader.

* Remove unnecessary new statement.

stream will never be a nullptr.

* Move self->dnsCache out of lambda capture.

Member variables are not capture by default, thus, `host` and `service` are not captured. This somehow successfully compile, but throws std::bad_alloc or basic_string::_S_create exceptions when we call `host+":"+service` in dnsCache.remove().

* Revert unintended change.

* Address comments.
2022-06-13 20:24:30 -07:00
Hao Fu
9cee4c94e7
Safely remove fdb_transaction_get_range_and_flat_map (#7314) 2022-06-13 19:05:00 -07:00
Trevor Clinkenbeard
6bed046148
Merge pull request #7352 from sfc-gh-xwang/feature/ddtxn
[DD testability enhancement] Create IDDTxnProcessor and simple refactoring
2022-06-13 16:01:13 -07:00