Skip to content

Conversation

@wingkitlee0
Copy link
Contributor

@wingkitlee0 wingkitlee0 commented Aug 16, 2025

Rationale for this change

This change fixes the serialization/deserialization (and equality) of S3FileSystem. Previously, the retry_strategy parameter is missed in the serialization. This affects other distributed libraries like Ray to incorrectly reconstruct a S3FileSystem with no retry_strategy parameter in other processes.

What changes are included in this PR?

The changes can be summarized in two parts, one in pyarrow and one in c++ library.

For pyarrow, this PR adds retry_strategy in S3FileSystem.__reduce__ and a property method to allow access to retry_strategy (a python object). Since S3FileSystem is a cdef class but AWSDefaultRetryStrategy etc are not, object type is used for simplicity.

In C++, an Equals method is added to S3RetryStrategy in C++. This allows the equality comparison in both c++ and python to consider retry strategy properly, e.g., comparing S3Options and S3FileSystem with retry strategy taken into account. For now, the comparison returns true when they are the same class and parameters (implemented for AwsDefaultRetryStrategy and AwsStandardRetryStrategy)

Are these changes tested?

[Python] Updated the unit tests to test for retry_strategy.
The new retry_strategy property is used to verify the correct reconstruction.

[C++] New tests are added for AwsRetryStrategy Equals() method.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #47348 has been automatically assigned in GitHub to PR creator.

@wingkitlee0 wingkitlee0 marked this pull request as ready for review August 16, 2025 22:30
@raulcd
Copy link
Member

raulcd commented Aug 18, 2025

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: 983021c6c7246f468a2647f34b0064625e08ddbc

Submitted crossbow builds: ursacomputing/crossbow @ actions-19f6d98dfd

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I am unsure about the object but I am not sure what the alternative would look like without spending some time.
@rok @AlenkaF can you take a look?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Aug 19, 2025

fs = S3FileSystem(
retry_strategy=AwsDefaultS3RetryStrategy(max_attempts=5))
assert isinstance(fs, S3FileSystem)
assert pickle_module.loads(pickle_module.dumps(fs)) == fs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but should also assert that it's different from a filesystem configured with another strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I will look into it. It seems like Cython will call cpp side of things, e.g., Equals of FileSystem. I think I will need to add AwsRetryStrategy::Equals as well.

@wingkitlee0 wingkitlee0 changed the title GH-47348: [Python] Include retry_strategy in S3FileSystem.__reduce__ GH-47348: [C++][Python] Make S3FileSystem pickle-able with retry strategy Sep 5, 2025
@wingkitlee0
Copy link
Contributor Author

@raulcd @pitrou please take a second look. I have added some code in c++ and updated the PR description.

Based on one of the comments, to compare S3FileSystem with different retry strategies, I had to implement it in C++. I tried multiple things but found it's most straight-forward to use std::visit. It should be flexible enough for a new retry strategy to define its equality. I did update a few things in AwsRetryStrategy because of the use std::variant, but it's a private API so I hope it's okay.

C++ is not my day-job language, any comment is appreciated. thanks.

@wingkitlee0
Copy link
Contributor Author

bump

@wingkitlee0
Copy link
Contributor Author

Hi @pitrou @raulcd, may I get a review on this PR? Thanks.

@wingkitlee0
Copy link
Contributor Author

wingkitlee0 commented Dec 16, 2025

Hi @AlenkaF , just wonder if you can give this PR a review or help me find someone to do this? I am an open-source contributor to Ray and this filesystem's retry strategy has been bugging us for a while because they were not properly pickled and sent to remote processes.

Thank you.

…ryStrategy

[C++] Added Equals() to S3RetryStrategy. Now S3Options and S3FileSystem will
compare their retry strategies as well.

[Python] Included retry_strategy in S3FileSystem.__reduce__ method.
S3FileSystem's retry_strategy should survive serialization and deserialization.

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants