Skip to content

GH-43057: [C++] Thread-safe AesEncryptor / AesDecryptor #44990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Apr 2, 2025

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Dec 10, 2024

Rationale for this change

OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call WipeOut while other threads still use the instance.

What changes are included in this PR?

  • Remove the WipeOut methods and related datastructures entirely.
  • Each call into CtrEncrypt / CtrDecrypt and GcmEncrypt / GcmDecrypt uses its own EVP_CIPHER_CTX instance, making this thread-safe.

After fixing this "AesDecryptor was wiped out" issue, two other segmentation faults surfaced: GH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue.

Fixes GH-43057.
Fixes GH-44852.
Fixes GH-44988.

Are these changes tested?

A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs.

Are there any user-facing changes?

No.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 11, 2024
@adamreeve
Copy link
Contributor

The test failures look like they might be caused by openssl/openssl#21955

The failing ASAN test run is using OpenSSL 3.02, but builds with other versions are passing. Eg. the Ubuntu 20.04 build uses OpenSSL 1.1.1f and the Conda build has OpenSSL 3.3.2.

The fix for this was backported to the 3.0 branch and released in 3.0.13 (openssl/openssl#23102), but Ubuntu 22.04 still uses 3.0.2. If there's not an easy way around this, maybe we need a slower code path for 3.0.2 that creates a new context each time?

@pitrou
Copy link
Member

pitrou commented Dec 11, 2024

If there's not an easy way around this, maybe we need a slower code path for 3.0.2 that creates a new context each time?

Let's use the "slower" code path everywhere? I'm skeptical it will be that slow...

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for stepping in and submitting this @EnricoMi . You'll find a number of comments below.

@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from a2e3992 to 108c8cf Compare December 11, 2024 19:49
@EnricoMi
Copy link
Contributor Author

I have addressed all comments except for the test related ones. Will go over them tomorrow.

Thanks for your input, that is highly appreciated!

@adamreeve
Copy link
Contributor

I tested concurrent scans of a larger dataset with uniform encryption, and this change doesn't completely fix that scenario:

--- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -289,6 +289,8 @@ TEST_F(DatasetEncryptionTest, ReadSingleFile) {
 // processing encrypted datasets over 2^15 rows in multi-threaded mode.
 class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
  public:
+  LargeRowEncryptionTest() : DatasetEncryptionTestBase(true) {}
+
   // The dataset is partitioned using a Hive partitioning scheme.
   void PrepareTableAndPartitioning() override {
     // Specifically chosen to be greater than batch size for triggering prefetch.
@@ -307,7 +309,7 @@ class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
 
 // Test for writing and reading encrypted dataset with large row count.
 TEST_F(LargeRowEncryptionTest, ReadEncryptLargeRows) {
-  ASSERT_NO_FATAL_FAILURE(TestScanDataset());
+  ASSERT_NO_FATAL_FAILURE(TestScanDataset(true));
 }
 
 }  // namespace dataset

This gives errors like:

failed with IOError: Failed decryption finalization

I believe this is due to the decryptor AAD being mutable. It's updated for each data page read, so concurrent use will result in incorrect AADs being used in some threads.

Rather than mutating the decryptor state, it might be possible to refactor this so that the AAD values are passed in to the decrypt method as a parameter.

I think this should probably be addressed as a separate PR though as the changes should be orthogonal.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Dec 12, 2024

failed with IOError: Failed decryption finalization

Confirmed in ec9b054. Multi-threaded uniform encryption read fails.

@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch 2 times, most recently from 33ff64b to 6508c5a Compare December 12, 2024 09:49
@EnricoMi
Copy link
Contributor Author

failed with IOError: Failed decryption finalization

Confirmed in ec9b054. Multi-threaded uniform encryption read fails.

Attempt to fix multi-threaded AAD update: d77a081

The PageReader uses its own copy of the data_decryptor / data_decryptor. Happy to move into separate PR.

@mapleFU
Copy link
Member

mapleFU commented Jan 21, 2025

Any updates?

@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from 4fc5b42 to b652e06 Compare January 21, 2025 08:55
@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from b652e06 to fbe3a1c Compare January 29, 2025 18:25
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EnricoMi Sorry for the delay and thanks a lot for persevering on this. I think the fix is still not 100% correct, see comments below. But we're nearing a solution.

Also, I can help on this PR if you want.

@EnricoMi EnricoMi force-pushed the thread-safe-endecryptor branch from 85c55ba to 4d388fa Compare February 6, 2025 17:17
@pitrou

This comment was marked as outdated.

This comment was marked as outdated.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 1, 2025

@pitrou suggestions applied and rebased

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

@github-actions crossbow submit -g cpp -g python

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

@github-actions crossbow submit cp313

Copy link

github-actions bot commented Apr 2, 2025

Revision: a587553

Submitted crossbow builds: ursacomputing/crossbow @ actions-05dac70ebf

Task Status
wheel-macos-monterey-cp313-cp313-amd64 GitHub Actions
wheel-macos-monterey-cp313-cp313-arm64 GitHub Actions
wheel-macos-monterey-cp313-cp313t-amd64 GitHub Actions
wheel-macos-monterey-cp313-cp313t-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313t-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313t-arm64 GitHub Actions
wheel-manylinux-2014-cp313-cp313-amd64 GitHub Actions
wheel-manylinux-2014-cp313-cp313-arm64 GitHub Actions
wheel-manylinux-2014-cp313-cp313t-amd64 GitHub Actions
wheel-manylinux-2014-cp313-cp313t-arm64 GitHub Actions
wheel-musllinux-1-2-cp313-cp313-amd64 GitHub Actions
wheel-musllinux-1-2-cp313-cp313-arm64 GitHub Actions
wheel-musllinux-1-2-cp313-cp313t-amd64 GitHub Actions
wheel-musllinux-1-2-cp313-cp313t-arm64 GitHub Actions
wheel-windows-cp313-cp313-amd64 GitHub Actions
wheel-windows-cp313-cp313t-amd64 GitHub Actions

Copy link

github-actions bot commented Apr 2, 2025

Revision: a587553

Submitted crossbow builds: ursacomputing/crossbow @ actions-34c96e7a1d

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

@github-actions crossbow submit -g cpp

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for this @EnricoMi !

Copy link

github-actions bot commented Apr 2, 2025

Revision: 9961069

Submitted crossbow builds: ursacomputing/crossbow @ actions-ebb6807bc5

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 2, 2025

@pitrou thank you for your time! Will this make it into the 20.0.0 release?

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

@EnricoMi It should, I'll wait for CI and merge if green.

@zanmato1984
Copy link
Contributor

I think the release 20.0.0 has been feature-freezed, and the maint-20.0.0 has been branched earlier today?

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 2, 2025

Can we consider this a bug fix and not a feature so it could still make it into the 20.0.0 maint branch? Or is "feature freeze" more a general "code freeze"?

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

This is definitely a bug fix (it fixes an actual bug) even though it's quite elaborate and goes deeper than simply preventing nasty behavior.

@zanmato1984
Copy link
Contributor

Well, I mean that with feature freeze done (== maintenance branch branched out), it doesn't happen naturally for this PR to be in 20.0.0. We may ask for help from release managers. And yes a bug fix still gets a chance to be back-ported (depending on the severity I guess?).

@raulcd
Copy link
Member

raulcd commented Apr 2, 2025

The release manager (@assignUser ) has been notified and is happy to add it to the maintenance branch, as per Zulip conversation.

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

CI failures are unrelated.

@pitrou pitrou merged commit 6452194 into apache:main Apr 2, 2025
29 of 32 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Apr 2, 2025
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 2, 2025

Awesome, many thanks! Happy to help with the back-porting, please let me know if I can help!

@EnricoMi EnricoMi deleted the thread-safe-endecryptor branch April 2, 2025 10:49
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6452194.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 48 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Apr 15, 2025
…#44990)

### Rationale for this change

OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call `WipeOut` while other threads still use the instance.

### What changes are included in this PR?

- Remove the `WipeOut` methods and related datastructures entirely.
- Each call into `CtrEncrypt` / `CtrDecrypt` and `GcmEncrypt` / `GcmDecrypt` uses its own `EVP_CIPHER_CTX` instance, making this thread-safe.

After fixing this `"AesDecryptor was wiped out"` issue, two other segmentation faults surfaced: apacheGH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue.

Fixes apacheGH-43057.
Fixes apacheGH-44852.
Fixes apacheGH-44988.

### Are these changes tested?
A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs.

### Are there any user-facing changes?
No.
* GitHub Issue: apache#43057

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <pitrou@free.fr>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Apr 15, 2025
…#44990)

### Rationale for this change

OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call `WipeOut` while other threads still use the instance.

### What changes are included in this PR?

- Remove the `WipeOut` methods and related datastructures entirely.
- Each call into `CtrEncrypt` / `CtrDecrypt` and `GcmEncrypt` / `GcmDecrypt` uses its own `EVP_CIPHER_CTX` instance, making this thread-safe.

After fixing this `"AesDecryptor was wiped out"` issue, two other segmentation faults surfaced: apacheGH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue.

Fixes apacheGH-43057.
Fixes apacheGH-44852.
Fixes apacheGH-44988.

### Are these changes tested?
A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs.

### Are there any user-facing changes?
No.
* GitHub Issue: apache#43057

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <pitrou@free.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants