1

I have a simple pydantic model with nested data structures. I want to be able to simply save and load instances of this model as .json file.

All models inherit from a Base class with simple configuration.

class Base(pydantic.BaseModel):
    class Config:
        extra = 'forbid'   # forbid use of extra kwargs

There are some simple data models with inheritance

class Thing(Base):
    thing_id: int

class SubThing(Thing):
    name: str

And a Container class, which holds a Thing

class Container(Base):
    thing: Thing

I can create a Container instance and save it as .json

# make instance of container
c = Container(
    thing = SubThing(
        thing_id=1,
        name='my_thing')
)

json_string = c.json(indent=2)
print(json_string)

"""
{
  "thing": {
    "thing_id": 1,
    "name": "my_thing"
  }
}
"""

but the json string does not specify that the thing field was constructed using a SubThing. As such, when I try to load this string into a new Container instance, I get an error.

print(c)
"""
Traceback (most recent call last):
  File "...", line 36, in <module>
    c = Container.parse_raw(json_string)
  File "pydantic/main.py", line 601, in pydantic.main.BaseModel.parse_raw
  File "pydantic/main.py", line 578, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Container
thing -> name
  extra fields not permitted (type=value_error.extra)
"""

Is there a simple way to save the Container instance while retaining information about the thing class type such that I can reconstruct the initial Container instance reliably? I would like to avoid pickling the object if possible.

One possible solution is to serialize manually, for example using


def serialize(attr_name, attr_value, dictionary=None):
    if dictionary is None:
        dictionary = {}
    if not isinstance(attr_value, pydantic.BaseModel):
        dictionary[attr_name] = attr_value
    else:
        sub_dictionary = {}
        for (sub_name, sub_value) in attr_value:
            serialize(sub_name, sub_value, dictionary=sub_dictionary)
        dictionary[attr_name] = {type(attr_value).__name__: sub_dictionary}
    return dictionary


c1 = Container(
    container_name='my_container',
    thing=SubThing(
        thing_id=1,
        name='my_thing')
)

from pprint import pprint as print
print(serialize('Container', c1))

{'Container': {'Container': {'container_name': 'my_container',
                             'thing': {'SubThing': {'name': 'my_thing',
                                                    'thing_id': 1}}}}}

but this gets rid of most of the benefits of leveraging the package for serialization.

twhughes
  • 392
  • 3
  • 14
  • 1
    why are you using `pydantic` in any case - like do you benefit from the validations it provides? just curious – rv.kvetch Sep 19 '21 at 22:04
  • 1
    yes, I use it mainly for the validations, but in principle I could use something else. This is an extremely simplified version of my actual application. – twhughes Sep 19 '21 at 22:14
  • 1
    doing only a cursory look on the web, it looks like this is a known problem that `pydantic` doesn't support loading nested json to a model class, yet there are plans for future support in this use case. I was actually surprised that pydantic doesn't parse a dict to a nested model - seems like a common enough use case to me. – rv.kvetch Sep 19 '21 at 22:15
  • 1
    Using a root validator as mentioned [here](https://github.com/samuelcolvin/pydantic/issues/1189#issuecomment-578084930) might also work – rv.kvetch Sep 19 '21 at 22:16
  • Hm, thanks for the help. Do you know if this issue exists for other packages, such as dataclasses? What would you recommend to handle serialization of nested dataclass-like objects like this? Note that `dict(c)` seems to retain some field information, so one brute force option would be to write my own serializer, but I'd prefer leveraging the package – twhughes Sep 19 '21 at 22:21
  • 1
    I've tested serialization with dataclasses and that works perfectly for the most part. I did notice an issue with some field types, namely `defaultdict` fields for example. It looks like dataclasses doesn't handle serialization of such field types as expected (I guess it treats it as a normal dict). You can use the `dataclasses.asdict()` helper function to serialize a dataclass instance, which also works for nested dataclasses. The only problem is de-serializing it back from a dict, which unfortunately seems to be a missing link in dataclasses. – rv.kvetch Sep 20 '21 at 00:12
  • 1
    If you're interested in using dataclasses, you can take a look at [this answer](https://stackoverflow.com/questions/69128123/nested-python-dataclasses-with-list-annotations/69133191#69133191) that I added a while back, as it might be useful. With such a library you can simply use dataclasses and it will provide (de)serialization with minimal changes. It also supports loading a nested dataclass structure from any plain dict. – rv.kvetch Sep 20 '21 at 00:15
  • 1
    Added also a separate answer below (and an approach that I was able to get working with `pydantic` for this use case) – rv.kvetch Sep 20 '21 at 02:08

1 Answers1

1

Try this solution, which I was able to get it working with pydantic. It's a bit ugly and somewhat hackish, but at least it works as expected.

import pydantic


class Base(pydantic.BaseModel):
    class Config:
        extra = 'forbid'   # forbid use of extra kwargs


class Thing(Base):
    thing_id: int


class SubThing(Thing):
    name: str


class Container(Base):
    thing: Thing

    def __init__(self, **kwargs):
        # This answer helped steer me towards this solution:
        #   https://stackoverflow.com/a/66582140/10237506
        if not isinstance(kwargs['thing'], SubThing):
            kwargs['thing'] = SubThing(**kwargs['thing'])
        super().__init__(**kwargs)


def main():
    # make instance of container
    c1 = Container(
        thing=SubThing(
            thing_id=1,
            name='my_thing')
    )

    d = c1.dict()
    print(d)
    # {'thing': {'thing_id': 1, 'name': 'my_thing'}}

    # Now it works!
    c2 = Container(**d)

    print(c2)
    # thing=SubThing(thing_id=1, name='my_thing')
    
    # assert that the values for the de-serialized instance is the same
    assert c1 == c2


if __name__ == '__main__':
    main()

If you don't need some of the features that pydantic provides such as data validation, you can just use normal dataclasses easily enough. You can pair this with a (de)serialization library like dataclass-wizard that provides automatic case transforms and type conversion (for ex. string to annotated int) that works much the same as it does with pydantic. Here is a straightforward enough usage of that below:

from dataclasses import dataclass

from dataclass_wizard import asdict, fromdict


@dataclass
class Thing:
    thing_id: int


@dataclass
class SubThing(Thing):
    name: str


@dataclass
class Container:
    # Note: I had to update the annotation to `SubThing`. otherwise
    # when de-serializing, it creates a `Thing` instance which is not
    # what we want.
    thing: SubThing


def main():
    # make instance of container
    c1 = Container(
        thing=SubThing(
            thing_id=1,
            name='my_thing')
    )

    d = asdict(c1)
    print(d)
    # {'thing': {'thingId': 1, 'name': 'my_thing'}}

    # De-serialize a dict object in a new `Container` instance
    c2 = fromdict(Container, d)

    print(c2)
    # Container(thing=SubThing(thing_id=1, name='my_thing'))

    # assert that the values for the de-serialized instance is the same
    assert c1 == c2


if __name__ == '__main__':
    main()
rv.kvetch
  • 5,465
  • 3
  • 10
  • 28
  • Thanks for the suggestion, I think the 2nd one is not quite right because Container.thing now accepts SubThing, not Thing. – twhughes Sep 20 '21 at 03:28
  • 1
    Yep, you are right. I had to change the annotation to `thing: SubThing` as otherwise it'll try to load a dict into a `Thing` type. I'll update the answer to clarify. – rv.kvetch Sep 20 '21 at 03:47
  • 1
    actually, I see the problem now. I guess I didn't read the question above too carefully. – rv.kvetch Sep 20 '21 at 03:56
  • 1
    No, but there are be several other subclasses of `Things` so it should be able to handle those. – twhughes Sep 20 '21 at 03:56
  • 1
    I wrote a quick recursive function to serialize it by hand. I will edit my original question with the code. – twhughes Sep 20 '21 at 03:57
  • No worries, it's kind of an obscure question and I appreciate the help, I'm also surprised this doesn't seem to be supported natively in Pydantic or dataclasses. – twhughes Sep 20 '21 at 04:00
  • 1
    Yep, no problem and I get what you were asking now. I guess if you wanted to annotate a field as a more generic class such as `Thing` but populate it with a sub-class such as `SubThing` later, that would only make it a bit difficult to de-serialize back into a Container (since it would be looking to create a `Thing`, based on the annotation). A custom serializer with pydantic should hopefully work out for this use case. – rv.kvetch Sep 20 '21 at 04:06
  • 1
    I came up with a temporary solution to handle subclasses of `Thing` only, just declare `thing` as type `Union[SubThing1, SubThing2, ...]`. This covers my use case, with the exception that if the subclasses have the same kwarg signature, Pydantic will simply choose to serialize `thing` as `SubThing1`, no matter how it was initialized. It would be nice if Pydantic supported making the serialization type-aware, to handle subclasses like this. But for now I think this solution will do. – twhughes Sep 20 '21 at 15:08