Home Rumble Youtube Twitter/X Kofi Contact / Crypto

Sometimes there is no silver bullet...

Serialization is a bitch. You can't save pointers. You definitely can't save function pointers. Strings need length information if you don't want to scan them twice when loading, but using 8 bytes for the length of a short string might double its storage cost. Lists have a similar problem. Some struct members need to be saved and some are merely temporal cache. Hash tables need to be iterated to figure out how much space they need because of variable key length. Lots of data structures need to be walked to determine their storage cost for that matter. Internal free-lists need to be reconstructed after deserialization, and implicit element occupancy needs to be explicitly saved somehow.

And then there's versioning. How do you add new data to your serialization format? How do you keep this from becoming a maintenance nightmare?

How do you keep file sizes reasonable? How do you minimize boilerplate code?


The correct answer is that there's no correct answer. There are bad soluntion for sure, but no good ones. Every situation has its own requirements and has to make its own compromises.

Wrong Answers

"Just shove everything into [XML/JSON/whatever]." This is never a good answer for serializing program state. Everything has to be converted, awkwardly and slowly, to some string representation. Then it all has to be parsed and reconstructed on the othe side. Also nobody should ever use XML for anything ever. Generalized text formats have tremendous overhead and cannot handly arbitrary binary data in a reasonable way. Text files cannot be seeked; multithreaded loading becomes bottlenecked on text parsing.

"Just use A Real Database(TM)" Now you depend on some bloated SQL database to load your files. They're not known for maintining stable binary formats, which means you're now forever chasing around their versioning madness. And God forbid your chosen DB project change their license like Mongo did. Also real databases reserve a lot of extra disk space. They optimize for random access speed, not disk space. All of your data needs to be coerced into whatever structure the DB supports, creating an extra layer of awkward conversions and potentially limiting the shape of your internal data structures because the DB doesn't support them at a practical level.

"Just use SQLite" All the downsides of text combined will all the downsides of a real SQL database.

"Just use your language's builtin Class Pickler" Ok, so how do you change your class and still load old data? How does an old version of the program load a newer file? (Crazy feature, I know...) Can you tell it, in a reasonable manner, to only save certain members, or are you stuck serializing all the temporary junk too? Is the format part of the spec, or are you at the mercy of compiler vendors and versions? How does it handle pointers to external data? Does it recurse into them and store tons of duplicate copies of shared data? Will it get stuck in an infinite loop if you have cyclic references? These language features are almost always impractical for non-trivial applications.

Options That Aren't the Worst

Write a function to serialize each struct and append it to a buffer. Save the format version as the first field. Any data format can be handled with whatever logic or extra metadata is necessary. Deserialization involves first reading the version then choosing the right function to call. Changes to data structures can be accommodated in each respective version function including whatever special code is needed to patch or shim the data. Save files from the future can be recognized trivially and rejected with a useful error message. Nested data merely involves calling one serialization function from another since they work by appending data. Each serialization function returns the size in bytes of the data it appended.

The downside is that you have to keep around a function for every data version that you've ever shipped. Upsides include not having to keep around struct definitions too. The system is straightforward with no magic. It's relatively easy to debug, in my experience, compared to some other options; you can put sentinels inside the data stream to tell where it gets out of sync.

In my versions of this scheme, calling the Serialize function with a null buffer makes it just return the size required. Internally, this is done in the first half of the function by separate but similar code to the actual data handling part. Not only can you allocate a sufficiently sized buffer after an initial NULL call, but you can also verify that the buffer advanced by the same amount predicted by the first half of the function. A lot of small errors are caught by this.

It's not so tedious with the right set of helper functions.


Another option is to tag every field with some fixed ID unique within that struct. Files created by other versions of the program can be read by just ignoring unknown fields. All fields can be pre-initialized with sane defaults, if such a scheme works for your data, for the case of missing fields in the serialized data. Alternately, your deserialization function can analyze which fields are and aren't present and make logical decisions about how to handle the circumstances.

Downsides are that the tags, even if just one byte each, consume quite a bit of space in practice. Perhaps this doesn't matter much if you're running the results through a compression algorithm afterward. You also have to keep track of the unique tag values somehow and make sure that retired ones aren't reused. You can't just use the byte offset or an enum for it; this has to be manually configured and maintained.