GH-38792: [C++][Gandiva] Optimize cache #48597
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR supersedes #38790 since that has been inactive for a few years.
The original author is @likun666661
Rationale for this change
In Gandiva, it has been observed that the tool caches compiled expressions. However, when using the expression cache, Gandiva takes into account the schema, expr->ToString, and specific parameters. Nevertheless, in llvm_generator.cc, Gandiva still traverses the expressions to generate LLVM IR code and then produces specific JIT functions. Considering that llvm_generator traverses expressions and calculates the memory start positions of fields involved in the computation, we can further enhance Gandiva's caching mechanism.
Specifically, we can exclude the influence of the schema. Simultaneously, we can also exclude the field names' information involved in the expression computation. For example, for a schema {f0:int32, f1:int32} with an expression add(f0, f1), we can cache it initially. Then, for a schema {a0:int32, a1:int32, a2:int32} with an expression add(a1, a2), it should not require recompilation; it can directly reuse the JIT function generated by the expression {f0:int32, f1:int32}, add(f0, f1).
What changes are included in this PR?
This PR primarily alters the construction of the cache key and makes modifications to the schema validation in both the projector and filter. It also adjusts the validation of fields in expr_validator. Additionally, a ToCacheKeyString function has been added to each node. This PR mainly focuses on these areas.
Are these changes tested?
Yes,these changes are tested.
Are there any user-facing changes?
No.
#38792
Closes: #38792