-
Notifications
You must be signed in to change notification settings - Fork 971
[KYUUBI #7245] Fix arrow batch converter error #7246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7246 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 696 696
Lines 43530 43528 -2
Branches 5883 5881 -2
======================================
+ Misses 43530 43528 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Sorry, the changes have been made. |
|
The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that? |
okay,I made modifications based on this Spark issue |
|
hi I've merged the code from the latest master branch. Is there anything else I need to change? |
Why are the changes needed?
Control the amount of data to prevent memory overflow and increase to initial speed.
When
kyuubi.operation.result.format=arrow,spark.connect.grpc.arrow.maxBatchSizedoes not work as expected.Reproduction:
You can debug
KyuubiArrowConvertersor add the following log to line 300 ofKyuubiArrowConverters:Test data: 1.6 million rows, 30 columns per row. Command executed:
Log output
Original Code
When the
limitis not set, i.e.,-1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:(1) Driver/executor oom
(2) Array oom cause of array length is not enough
(3) Transfer data slowly
After updating the code, the log output is as follows:
The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.
Fix #7245.
How was this patch tested?
Test data: 1.6 million rows, 30 columns per row.
Was this patch authored or co-authored using generative AI tooling?
No