-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Allow extensions to arrow-json decoder and include an extension for variant #9021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| fn make_decoder( | ||
| field: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich you commented here on the other PR about it being awkward to take Field and still require data_type and is_nullable. I agree but the issue I ran into is that when ReaderBuilder::build_decoder calls make_decoder on the root of the schema it creates a Struct based on the Schema to pass in.
Another option here could be to create a dummy Field from the Schema. I'd have to look at whether data_type and is_nullable are always inherited from the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're not always inherited. Some callers are passing the union of all ancestors' nullability, i.e. when extracting a nested field from a struct. That's why I had suggested perhaps just passing the field metadata here -- because I don't think the nullable and data type members of a passed-in Field would be used, potentially leading to confusion.
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very fun! Sparks lots of ideas, see comments.
Also -- looking at the top-level documentation for arrow-json, we might want to consider some custom decoder support for binary data, to round trip it by encodings other than Base16? That would be way nicer than the currently recommended workaround to manually cast the columns?
| field.clone(), | ||
| data_type.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: We don't know yet that the factory will actually produce a decoder. Should the trait take &Arc so implementations can decide whether to clone them? Or is the overhead so (relatively) low compared to everything else involved with parsing that we don't care?
arrow-json/src/reader/mod.rs
Outdated
| pub trait DecoderFactory: std::fmt::Debug + Send + Sync { | ||
| /// Make a decoder that overrides the default decoder for a specific data type. | ||
| /// This can be used to override how e.g. error in decoding are handled. | ||
| fn make_default_decoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is IMO a misleading method name:
- The "default" encoder is whatever would have been created if no factory existed.
- A factory is used to override that default.
So it seems like make_decoder or make_custom_decoder would be more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to make_custom_decoder
| /// assert_eq!(values.value(0), "a"); | ||
| /// assert!(values.is_null(1)); | ||
| /// ``` | ||
| pub trait DecoderFactory: std::fmt::Debug + Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am understanding correctly here, one could almost use this framework to set up a fully customizable decoding. But not quite.
The decoder factory could intercept the top-level struct of the input schema and assign a custom decoder to every field based on path and/or type. That could be really slick for custom error handling, for example turning all incompatible inputs to NULL instead of producing errors (potentially only for a subset of known-funky fields while leaving "reliable" fields fully strongly and strictly typed). Problem is, if the decoder factory intercepts a complex type, it has to provide a custom decoder for every field of that struct (and wrap them all in a custom complex type decoder). Ouch.
I don't know that we would want to make all the internal decoders public -- they are currently pub members of private modules -- but is there perhaps a way to give decoder factories access to some functionality like make_decoder, even if only implicitly? A way to say "do whatever you normally would have done here"?
For primitives, it might be as simple as defining a DefaultDecoderFactory that forwards blindly to make_decoder.
But struct/map/list would need something more, because the decoder factory for a complex type would need to assemble the custom sub-decoders of that type's children (if not defaulted). Maybe the hypothetical DefaultDecoderFactory would have associated make_xxx_decoder functions which are just validating wrappers that create the appropriate decoder directly from user-provided child decoders? e.g. make_list_decoder would take a single Box<dyn ArrayDecoder>, make_map_decoder would take two decoders, and make_struct_decoder would take a vec of field decoders. Alternatively, we could make the complex type decoders pub and define actual try_new. But that would increase the number of types in an asymmetric way (where are the primitive decoders).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im a little worried about this Pr growing in scope. (My main goal was to make it so that variant could be decoded and I've already had to pull in the extension code too). Nevertheless I think it's worth considering more usefulness.
Im hearing two related uses.
First, for changing the behavior of some fields in complex types: I think another way of handling this is to augment the schema in some way with a special decoder kind. This could allow for the default behavior to be used by the decoder (including on structs/lists) but specialized decoders substituted in directly on the (sub)fields as indicated by the schema. Im not sure if this would fit into the existing schema type (eg special metadata) or if a new kind of DecoderSchema would be created to hold this information.
Second for catching errors on primitive types and substituting something else (e.g. NULL) my suggestion is to make that be an optional setting on the built in primitive decoder. Possibly when combined with the solution above to make it so that certain fields could be marked with this behavior in the schema.
| /// This decoder is typically used indirectly via [`VariantArrayDecoderFactory`] when | ||
| /// reading JSON data into Variant columns. | ||
| #[derive(Default)] | ||
| pub struct VariantArrayDecoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized! When combined with the variant shredding struct translation work done elsewhere -- and the full-custom decoding support I mentioned above -- we could define a ShreddedVariantArrayDecoder that turns JSON directly into shredded variant. That would be super slick when parsing mostly-strongly-typed schemas.
| let mut builder = VariantBuilder::new(); | ||
| variant_from_tape_element(&mut builder, *p, tape)?; | ||
| let (metadata, value) = builder.finish(); | ||
| array_builder.append_value(Variant::new(&metadata, &value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VariantArrayBuilder implements VariantBuilderExt API. You should be able to just pass it to variant_from_tape_element. Avoids the intermediate vec allocations and validation this loop currently pays.
| let mut builder = VariantBuilder::new(); | |
| variant_from_tape_element(&mut builder, *p, tape)?; | |
| let (metadata, value) = builder.finish(); | |
| array_builder.append_value(Variant::new(&metadata, &value)); | |
| variant_from_tape_element(&mut array_builder, *p, tape)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| let (metadata, value) = builder.finish(); | ||
| array_builder.append_value(Variant::new(&metadata, &value)); | ||
| } | ||
| let variant_struct_array: StructArray = array_builder.build().into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If type annotation is anyway needed, consider From instead of Into:
| let variant_struct_array: StructArray = array_builder.build().into(); | |
| let variant_struct_array = StructArray::from(array_builder.build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| _is_nullable: bool, | ||
| _struct_mode: StructMode, | ||
| ) -> Result<Option<Box<dyn ArrayDecoder>>, ArrowError> { | ||
| if let Some(field) = field && field.try_extension_type::<VariantType>().is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if-let chaining would bump MSRV, no?
(if MSRV already bumped... yay!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile tho -- wouldn't it be nicer to have a boolean-returning compat checker on extension types, that doesn't require allocating the actual extension type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like its availabie in 2024 edition (which we have). Also modified to do an equality check on the extension name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was more about the extension type API, less about your code that's forced to use what we currently have. Sorry if that was unclear.
I don't think it's enough to just check the extension name. The extension may involve other metadata and also has opinions about which physical data types are valid. I suppose we could do a hybrid -- only try to instantiate the extension type if the name matches?
| } | ||
| object_builder.finish(); | ||
| } | ||
| TapeElement::EndObject(_u32) => { return Err(ArrowError::JsonError("unexpected end of object".to_string())) }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure fmt doesn't ever produce this pattern. Is CI not catching it somehow?
(I saw some long lines earlier as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right I had forgotten to run fmt.
|
CC @nicklan -- this could potentially allow more flexible parsing of JSON data that may not match the expected schema. Tho maybe it's still better to just parse those as variant. |
arrow-json/src/reader/mod.rs
Outdated
| pub trait DecoderFactory: std::fmt::Debug + Send + Sync { | ||
| /// Make a decoder that overrides the default decoder for a specific data type. | ||
| /// This can be used to override how e.g. error in decoding are handled. | ||
| fn make_default_decoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to make_custom_decoder
| let mut builder = VariantBuilder::new(); | ||
| variant_from_tape_element(&mut builder, *p, tape)?; | ||
| let (metadata, value) = builder.finish(); | ||
| array_builder.append_value(Variant::new(&metadata, &value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| let (metadata, value) = builder.finish(); | ||
| array_builder.append_value(Variant::new(&metadata, &value)); | ||
| } | ||
| let variant_struct_array: StructArray = array_builder.build().into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| _is_nullable: bool, | ||
| _struct_mode: StructMode, | ||
| ) -> Result<Option<Box<dyn ArrayDecoder>>, ArrowError> { | ||
| if let Some(field) = field && field.try_extension_type::<VariantType>().is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like its availabie in 2024 edition (which we have). Also modified to do an equality check on the extension name.
| } | ||
| object_builder.finish(); | ||
| } | ||
| TapeElement::EndObject(_u32) => { return Err(ArrowError::JsonError("unexpected end of object".to_string())) }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right I had forgotten to run fmt.
| TapeElement::I64(i) => { | ||
| return Err(ArrowError::JsonError(format!( | ||
| "I64 tape element not supported: {i}" | ||
| ))); | ||
| } | ||
| TapeElement::I32(i) => { | ||
| return Err(ArrowError::JsonError(format!( | ||
| "I32 tape element not supported: {i}" | ||
| ))); | ||
| } | ||
| TapeElement::F64(f) => { | ||
| return Err(ArrowError::JsonError(format!( | ||
| "F64 tape element not supported: {f}" | ||
| ))); | ||
| } | ||
| TapeElement::F32(f) => { | ||
| return Err(ArrowError::JsonError(format!( | ||
| "F32 tape element not supported: {f}" | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases could easily be handled, but I could not get a JSON value to actually deserialize to these. Either I'm doing something wrong or they are unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are definitely unused. I noticed that as well when playing around earlier.
It's used for converting values to JSON, perhaps?
| } | ||
|
|
||
| fn parse_number<'a, 'b>(s: &'a str) -> Result<Variant<'a, 'b>, ArrowError> { | ||
| if let Ok(v) = lexical_core::parse(s.as_bytes()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why use lexical_core instead of the built-in number parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No preference really. I just saw it used in a few other places in arrow-json.
Which issue does this PR close?
Closes #8987
What changes are included in this PR?
This incorporates feedback from the discussion on #8998 and includes #7442 with some small adjustments.
With this change you can now register a custom decoder on arrow-json reader.
As a motivating example, this PR also provides the
VariantArrayDecoderFactorywhich allows for Variant annotated Fields to be deserialized into Variant columns.For example
This includes #7442 with only two minor changes:
make_default_decoderin theDecoderFactorytrait, which seemed redundantfieldas a parameter tomake_decoderwhich is necessary to extract the extension information, although it is somewhat redundant.@scovich I incorporated your review feedback from #8998, except where noted inline in the code.
Are these changes tested?
Unit test was added. Open to feedback on whether that test is adequate or more is expected.
Are there any user-facing changes?
Yes, users can now register custom decoders with arrow-json Reader, and use
VariantArrayDecoderFactoryto deserialize json into Variant columns.