Skip to content

Conversation

@debugmiller
Copy link

Which issue does this PR close?

Closes #8987

What changes are included in this PR?

This incorporates feedback from the discussion on #8998 and includes #7442 with some small adjustments.

With this change you can now register a custom decoder on arrow-json reader.

As a motivating example, this PR also provides the VariantArrayDecoderFactory which allows for Variant annotated Fields to be deserialized into Variant columns.

For example

let variant_array = VariantArrayBuilder::new(0).build();

let struct_field = Schema::new(vec![
    Field::new("id", DataType::Int32, false),
    // call VariantArray::field to get the correct Field
    variant_array.field("var"),
]);

let builder = ReaderBuilder::new(Arc::new(struct_field.clone()));
let result = builder
    .with_struct_mode(StructMode::ObjectOnly)
    .with_decoder_factory(Arc::new(VariantArrayDecoderFactory))
    .build(Cursor::new(b"{\"id\": 1, \"var\": [\"mixed data\", 1]}"));

This includes #7442 with only two minor changes:

  • removed the default implementation of make_default_decoder in the DecoderFactory trait, which seemed redundant
  • adds field as a parameter to make_decoder which is necessary to extract the extension information, although it is somewhat redundant.

@scovich I incorporated your review feedback from #8998, except where noted inline in the code.

Are these changes tested?

Unit test was added. Open to feedback on whether that test is adequate or more is expected.

Are there any user-facing changes?

Yes, users can now register custom decoders with arrow-json Reader, and use VariantArrayDecoderFactory to deserialize json into Variant columns.

@github-actions github-actions bot added arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Dec 19, 2025
}

fn make_decoder(
field: Option<FieldRef>,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich you commented here on the other PR about it being awkward to take Field and still require data_type and is_nullable. I agree but the issue I ran into is that when ReaderBuilder::build_decoder calls make_decoder on the root of the schema it creates a Struct based on the Schema to pass in.

Another option here could be to create a dummy Field from the Schema. I'd have to look at whether data_type and is_nullable are always inherited from the field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not always inherited. Some callers are passing the union of all ancestors' nullability, i.e. when extracting a nested field from a struct. That's why I had suggested perhaps just passing the field metadata here -- because I don't think the nullable and data type members of a passed-in Field would be used, potentially leading to confusion.

@debugmiller debugmiller marked this pull request as ready for review December 19, 2025 22:38
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fun! Sparks lots of ideas, see comments.

Also -- looking at the top-level documentation for arrow-json, we might want to consider some custom decoder support for binary data, to round trip it by encodings other than Base16? That would be way nicer than the currently recommended workaround to manually cast the columns?

Comment on lines +802 to +803
field.clone(),
data_type.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: We don't know yet that the factory will actually produce a decoder. Should the trait take &Arc so implementations can decide whether to clone them? Or is the overhead so (relatively) low compared to everything else involved with parsing that we don't care?

pub trait DecoderFactory: std::fmt::Debug + Send + Sync {
/// Make a decoder that overrides the default decoder for a specific data type.
/// This can be used to override how e.g. error in decoding are handled.
fn make_default_decoder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is IMO a misleading method name:

  • The "default" encoder is whatever would have been created if no factory existed.
  • A factory is used to override that default.

So it seems like make_decoder or make_custom_decoder would be more descriptive?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to make_custom_decoder

/// assert_eq!(values.value(0), "a");
/// assert!(values.is_null(1));
/// ```
pub trait DecoderFactory: std::fmt::Debug + Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am understanding correctly here, one could almost use this framework to set up a fully customizable decoding. But not quite.

The decoder factory could intercept the top-level struct of the input schema and assign a custom decoder to every field based on path and/or type. That could be really slick for custom error handling, for example turning all incompatible inputs to NULL instead of producing errors (potentially only for a subset of known-funky fields while leaving "reliable" fields fully strongly and strictly typed). Problem is, if the decoder factory intercepts a complex type, it has to provide a custom decoder for every field of that struct (and wrap them all in a custom complex type decoder). Ouch.

I don't know that we would want to make all the internal decoders public -- they are currently pub members of private modules -- but is there perhaps a way to give decoder factories access to some functionality like make_decoder, even if only implicitly? A way to say "do whatever you normally would have done here"?

For primitives, it might be as simple as defining a DefaultDecoderFactory that forwards blindly to make_decoder.

But struct/map/list would need something more, because the decoder factory for a complex type would need to assemble the custom sub-decoders of that type's children (if not defaulted). Maybe the hypothetical DefaultDecoderFactory would have associated make_xxx_decoder functions which are just validating wrappers that create the appropriate decoder directly from user-provided child decoders? e.g. make_list_decoder would take a single Box<dyn ArrayDecoder>, make_map_decoder would take two decoders, and make_struct_decoder would take a vec of field decoders. Alternatively, we could make the complex type decoders pub and define actual try_new. But that would increase the number of types in an asymmetric way (where are the primitive decoders).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im a little worried about this Pr growing in scope. (My main goal was to make it so that variant could be decoded and I've already had to pull in the extension code too). Nevertheless I think it's worth considering more usefulness.

Im hearing two related uses.

First, for changing the behavior of some fields in complex types: I think another way of handling this is to augment the schema in some way with a special decoder kind. This could allow for the default behavior to be used by the decoder (including on structs/lists) but specialized decoders substituted in directly on the (sub)fields as indicated by the schema. Im not sure if this would fit into the existing schema type (eg special metadata) or if a new kind of DecoderSchema would be created to hold this information.

Second for catching errors on primitive types and substituting something else (e.g. NULL) my suggestion is to make that be an optional setting on the built in primitive decoder. Possibly when combined with the solution above to make it so that certain fields could be marked with this behavior in the schema.

/// This decoder is typically used indirectly via [`VariantArrayDecoderFactory`] when
/// reading JSON data into Variant columns.
#[derive(Default)]
pub struct VariantArrayDecoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized! When combined with the variant shredding struct translation work done elsewhere -- and the full-custom decoding support I mentioned above -- we could define a ShreddedVariantArrayDecoder that turns JSON directly into shredded variant. That would be super slick when parsing mostly-strongly-typed schemas.

Comment on lines 43 to 46
let mut builder = VariantBuilder::new();
variant_from_tape_element(&mut builder, *p, tape)?;
let (metadata, value) = builder.finish();
array_builder.append_value(Variant::new(&metadata, &value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariantArrayBuilder implements VariantBuilderExt API. You should be able to just pass it to variant_from_tape_element. Avoids the intermediate vec allocations and validation this loop currently pays.

Suggested change
let mut builder = VariantBuilder::new();
variant_from_tape_element(&mut builder, *p, tape)?;
let (metadata, value) = builder.finish();
array_builder.append_value(Variant::new(&metadata, &value));
variant_from_tape_element(&mut array_builder, *p, tape)?;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

let (metadata, value) = builder.finish();
array_builder.append_value(Variant::new(&metadata, &value));
}
let variant_struct_array: StructArray = array_builder.build().into();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If type annotation is anyway needed, consider From instead of Into:

Suggested change
let variant_struct_array: StructArray = array_builder.build().into();
let variant_struct_array = StructArray::from(array_builder.build());

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_is_nullable: bool,
_struct_mode: StructMode,
) -> Result<Option<Box<dyn ArrayDecoder>>, ArrowError> {
if let Some(field) = field && field.try_extension_type::<VariantType>().is_ok() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if-let chaining would bump MSRV, no?
(if MSRV already bumped... yay!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile tho -- wouldn't it be nicer to have a boolean-returning compat checker on extension types, that doesn't require allocating the actual extension type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like its availabie in 2024 edition (which we have). Also modified to do an equality check on the extension name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was more about the extension type API, less about your code that's forced to use what we currently have. Sorry if that was unclear.

I don't think it's enough to just check the extension name. The extension may involve other metadata and also has opinions about which physical data types are valid. I suppose we could do a hybrid -- only try to instantiate the extension type if the name matches?

}
object_builder.finish();
}
TapeElement::EndObject(_u32) => { return Err(ArrowError::JsonError("unexpected end of object".to_string())) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure fmt doesn't ever produce this pattern. Is CI not catching it somehow?
(I saw some long lines earlier as well)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right I had forgotten to run fmt.

@scovich
Copy link
Contributor

scovich commented Dec 21, 2025

CC @nicklan -- this could potentially allow more flexible parsing of JSON data that may not match the expected schema. Tho maybe it's still better to just parse those as variant.

pub trait DecoderFactory: std::fmt::Debug + Send + Sync {
/// Make a decoder that overrides the default decoder for a specific data type.
/// This can be used to override how e.g. error in decoding are handled.
fn make_default_decoder(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to make_custom_decoder

Comment on lines 43 to 46
let mut builder = VariantBuilder::new();
variant_from_tape_element(&mut builder, *p, tape)?;
let (metadata, value) = builder.finish();
array_builder.append_value(Variant::new(&metadata, &value));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

let (metadata, value) = builder.finish();
array_builder.append_value(Variant::new(&metadata, &value));
}
let variant_struct_array: StructArray = array_builder.build().into();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_is_nullable: bool,
_struct_mode: StructMode,
) -> Result<Option<Box<dyn ArrayDecoder>>, ArrowError> {
if let Some(field) = field && field.try_extension_type::<VariantType>().is_ok() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like its availabie in 2024 edition (which we have). Also modified to do an equality check on the extension name.

}
object_builder.finish();
}
TapeElement::EndObject(_u32) => { return Err(ArrowError::JsonError("unexpected end of object".to_string())) },
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right I had forgotten to run fmt.

Comment on lines +135 to +154
TapeElement::I64(i) => {
return Err(ArrowError::JsonError(format!(
"I64 tape element not supported: {i}"
)));
}
TapeElement::I32(i) => {
return Err(ArrowError::JsonError(format!(
"I32 tape element not supported: {i}"
)));
}
TapeElement::F64(f) => {
return Err(ArrowError::JsonError(format!(
"F64 tape element not supported: {f}"
)));
}
TapeElement::F32(f) => {
return Err(ArrowError::JsonError(format!(
"F32 tape element not supported: {f}"
)));
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases could easily be handled, but I could not get a JSON value to actually deserialize to these. Either I'm doing something wrong or they are unused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are definitely unused. I noticed that as well when playing around earlier.
It's used for converting values to JSON, perhaps?

}

fn parse_number<'a, 'b>(s: &'a str) -> Result<Variant<'a, 'b>, ArrowError> {
if let Ok(v) = lexical_core::parse(s.as_bytes()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why use lexical_core instead of the built-in number parsing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference really. I just saw it used in a few other places in arrow-json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-json] deserialize Variant fields

3 participants