Serialization is a bitch. You can't save pointers. You definitely can't save function pointers. Strings need length information if you
don't want to scan them twice when loading, but using
8 bytes for the length of a short string might double its storage cost. Lists have a similar problem. Some struct members need to be saved and some
are merely temporal cache. Hash tables need to be iterated to figure out how much space they need because of variable key length. Lots of data structures
need to be walked to determine their storage cost for that matter. Internal free-lists need to be reconstructed after deserialization, and implicit element
occupancy needs to be explicitly saved somehow.
And then there's versioning. How do you add new data to your serialization format? How do you keep this from becoming a maintenance nightmare?
How do you keep file sizes reasonable? How do you minimize boilerplate code?
The correct answer is that there's no correct answer. There are bad soluntion for sure, but no good ones. Every situation has its own requirements
and has to make its own compromises.
Wrong Answers
"Just shove everything into [XML/JSON/whatever]." This is never a good answer for serializing program state. Everything has to be converted,
awkwardly and slowly, to some string representation. Then it all has to be parsed and reconstructed on the othe side. Also nobody should
ever use XML for anything ever. Generalized text formats have tremendous overhead and cannot handly arbitrary binary data in a reasonable way.
Text files cannot be seeked; multithreaded loading becomes bottlenecked on text parsing.
"Just use A Real Database(TM)" Now you depend on some bloated SQL database to load your files. They're not known for maintining stable
binary formats, which means you're now forever chasing around their versioning madness. And God forbid your chosen DB project change their license
like Mongo did. Also real databases reserve a lot of extra disk space. They optimize for random access speed, not disk space. All of your data needs to
be coerced into whatever structure the DB supports, creating an extra layer of awkward conversions and potentially limiting the shape of your internal
data structures because the DB doesn't support them at a practical level.
"Just use SQLite" All the downsides of text combined will all the downsides of a real SQL database.
"Just use your language's builtin Class Pickler" Ok, so how do you change your class and still load old data? How does an old version of
the program load a newer file? (Crazy feature, I know...) Can you tell it, in a reasonable manner, to only save certain members, or are you stuck
serializing all the temporary junk too? Is the format part of the spec, or are you at the mercy of compiler vendors and versions? How does it handle
pointers to external data? Does it recurse into them and store tons of duplicate copies
of shared data? Will it get stuck in an infinite loop if you have cyclic references? These language features are almost always impractical for non-trivial
applications.
Options That Aren't the Worst
Write a function to serialize each struct and append it to a buffer. Save the format version as the first field. Any data format
can be handled with whatever logic or extra metadata is necessary. Deserialization involves first reading the version
then choosing the right function to call. Changes to data structures can be accommodated in each respective version function including whatever
special code is needed to patch or shim the data. Save files from the future can be recognized trivially and rejected with a useful error message.
Nested data merely involves calling one serialization function from another since they work by appending data. Each serialization function
returns the size in bytes of the data it appended.
The downside is that you have to keep around a function for every data version that you've ever shipped. Upsides include not having to keep
around struct definitions too. The system is straightforward with no magic. It's relatively easy to debug, in my experience, compared to
some other options; you can put sentinels inside the data stream to tell where it gets out of sync.
In my versions of this scheme, calling the Serialize function with a null buffer makes it just return the size required. Internally, this is done
in the first half of the function by separate but similar code to the actual data handling part. Not only can you allocate a sufficiently sized
buffer after an initial NULL call, but you can also verify that the buffer advanced by the same amount predicted by the first half of the function.
A lot of small errors are caught by this.
It's not so tedious with the right set of helper functions.
Another option is to tag every field with some fixed ID unique within that struct. Files created by other versions of the program can be read
by just ignoring unknown fields. All fields can be pre-initialized with sane defaults, if such a scheme works for your data, for the case of missing
fields in the serialized data. Alternately, your deserialization function can analyze which fields are and aren't present and make logical
decisions about how to handle the circumstances.
Downsides are that the tags, even if just one byte each, consume quite a bit of space in practice. Perhaps this doesn't matter much if
you're running the results through a compression algorithm afterward. You also have to keep track of the unique tag values somehow and make
sure that retired ones aren't reused. You can't just use the byte offset or an enum for it; this has to be manually configured and
maintained.