Bayle Shanks's website: proj-plbook-plChDataLangs

https://arp242.net/weblog/json_as_configuration_files-_please_dont
"I don't aversion to braces. Rather, my issues with JSON is that it doesn't have comments and that you cannot use a optional trailing comma." [5]
"JSON is slow. JSONB was created because JSON was slow. JSONB is still slow. Object (data) serialization has always been of interest to protocols. When one system must communicate data to another, both systems must agree on a format for transmission. While JSON is naturally debuggable, it does not foster agreement (specifically on numeric values) and it is truly abysmal on the performance side. This is why so many protocol serializations exist today." -- https://www.circonus.com/2017/11/some-like-it-flat/
http://seriot.ch/parsing_json.php or http://seriot.ch/projects/parsing_json.html
(for config): "I have to say that this is the best approach. Just take JSON and let the user generate it using whatever tool or language they prefer. I can see some users being annoyed if they are maintaining the config by hand but at least for me that isn’t an issue I have." -- [6]

JSON extensions and variants

JSON-LD
JSON Pointer
- https://tools.ietf.org/html/rfc6901
- http://susanpotter.net/blogs/software/2011/07/why-json-pointer-falls-short/
JSONB
BIPF
Smile
https://json5.org/
https://romefrontend.dev/#rome-json
https://hjson.github.io/
- HJSON opinions:
- https://dystroy.org/blog/hjson-in-broot/
JWCC: JSON With Commas and Comments
JSON-HAL
A Benchmark of JSON-compatible Binary Serialization Specifications
IJSON subset
https://jsonlines.org/
json-rpc

JCOP

https://github.com/mortie/jcof

" JCOF tries to be a drop-in replacement for JSON, with most of the same semantics, but with a much more compact representation of objects. The main way it does this is to introduce a string table at the beginning of the object, and then replace all strings with indexes into that string table. It also employs a few extra tricks to make objects as small as possible, without losing the most important benefits of JSON. Most importantly, it remains a text-based, schemaless format.

The following JSON object:

{ "people": [ {"first-name": "Bob", "age": 32, "occupation": "Plumber", "full-time": true}, {"first-name": "Alice", "age": 28, "occupation": "Programmer", "full-time": true}, {"first-name": "Bernard", "age": 36, "occupation": null, "full-time": null}, {"first-name": "El", "age": 57, "occupation": "Programmer", "full-time": false} ] }

could be represented as the following JCOF object:

Programmer;"age""first-name""full-time""occupation"; {"people"[(0,iw"Bob"b"Plumber")(0,is"Alice"b,s0)(0,iA"Bernard"n,n)(0,iV"El"B,s0)]} "

Discussions:

https://lobste.rs/s/5edgkf/i_made_jcof_object_notation_which_encodes

ASN.1

https://en.m.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Older format. Considered 'heavyweight', unpopular today.

"A while ago, I started trying to write a Rust implementation of an ASN.1 library. Just parsing the message description format is a nightmare, since it has so many features, additions, and modifications over time. You can have generic types in your serialization format, which is kind of neat, but introduces even more complexity.

There is some stuff in there that is good (like the Canonical Encoding Rules which let you let you have a 1:1 binary mapping to a given piece of data, critical for crypto), but for general server-to-server RPC, it is more complicated than is generally needed." -- https://www.reddit.com/r/rust/comments/daja9b/binary_format_shootout_capn_protoflatbuffers_and/f22o7de/

Messagepack

https://github.com/msgpack/msgpack/blob/master/spec.md

Opinions:

https://news.ycombinator.com/item?id=21647208

vs JSON:

https://github.com/ludocode/msgpack-tools#differences-between-messagepack-and-json

Variants and extensions:

https://github.com/msgpack-rpc/msgpack-rpc/blob/master/spec.md

BSON

Muon

https://github.com/vshymanskyy/muon https://docs.google.com/presentation/d/1MosK6LTy_Rr32eF6HKej6UEtf9vBzdbeSF6YPb1_e4A/present#slide=id.g13de7db3282_28_0 https://lobste.rs/s/a7ougq/on_compact_simple_binary_encoding_on_par

EDN

https://github.com/edn-format/edn

Not "a system for representing objects - there are no reference types, nor should a consumer have an expectation that two equivalent elements in some body of edn will yield distinct object identities when read" [7].

types:

nil
boolean
string
character
symbol ("Symbols are used to represent identifiers, and should map to something other than strings, if possible.")
keywords ("Keywords are identifiers that typically designate themselves. They are semantically akin to enumeration values.")
integer (64-bit, signed)
floating point (64-bit)
list (hetero)
vector (hetero) (difference from lists: a vector "supports random access")
map (keys and values can be elements of any type)
set (hetero)

extensibility through 'tags': a tag starts with '#', and indicates the semantics of the following element. E.g.:

myapp/Person {:first "Fred" :last "Mertz"}

"Upon encountering a tag, the reader will first read the next element (which may itself be or comprise other tagged elements), then pass the result to the corresponding handler for further interpretation, and the result of the handler will be the data value yielded by the tag + tagged element, i.e. reading a tag and tagged element yields one value."

built-in tagged elements:

date (RFC-3339 e.g. 1985-04-12T23:20:50.52Z)
uuid

comments: semicolon denotes a comment to-end-of-line

discard: #_ "indicates that the next element should be read and discarded...(note that the next element must still be a readable element"

Links:

(see also Datalog)

Datalog

Implementation:

https://github.com/nickelsworth/sympas/blob/master/text/16-datalog.org

Fressian

https://github.com/Datomic/fressian

Protobuf (Protocol Buffers)

Supported by Google.

Version 3 drops required fields and adds maps.

Primitive types: bool 32/64-bit float double string. Composite types: none, but repeated elements, and maps in v3. [8]

Resiliant to malicious input [9].

New fields ignored by old binaries. [10]

Does not support random access reads without parsing [11].

"there were three undesirable issues with Google Protobuf: (1) IRONdb is in C and while the C++ support is good, the C support is atrocious, (2) it conflates protocol with encoding so it becomes burdensome to not adopt gRPC, and (3) it’s actually pretty slow." -- https://www.circonus.com/2017/11/some-like-it-flat/

Opinions:

https://reasonablypolymorphic.com/blog/protos-are-wrong/index.html
- "No Compositionality...Protobuffers offer several “features”, but none of them...work with one another"
- "Fields with scalar types are always present...this inability to distinguish between unset and default values is a nightmare."
- the API for non-scalar types distinguishes between unset and default values, but erases this distinction upon copying, and the way to make a copy while remembering it cannot be genericized without macros.
- the API for oneof types allows them to be accidentally erased by copying the wrong way
https://news.ycombinator.com/item?id=26934514
https://news.ycombinator.com/item?id=26939086 vs capnproto vs flatbuffers
https://news.ycombinator.com/item?id=26938024 vs arrow

Alternate RPC implementations:

Links:

MessagePack

Bencode

" Indeed, json maps are not supposed to be ordered so doing anything that depends on the order is bound to fail.

This is the exact reason bencode (https://en.wikipedia.org/wiki/Bencode) was invented, and I still believe we can replace all uses of json by bencode and be better off it, because it solves all too common issues:

bencoding maps are in lexicographical order of the keys, so no confusion possible for hashing/signing (a torrent id is the hash of a bencoding map)
bencoding is binary friendly, in fact it must be because it stores the pieces hashes of the torrent

" -- [12]

CBOR

https://cbor.io/

Discussion:

"...CBOR's tagging system where you can attach a tag to otherwise mundane data to hint otherwise, which is in my opinion one of the biggest idea behind CBOR and why CBOR is not really a ripoff of MessagePack?", [13] referencing https://github.com/kriszyp/cbor-records
https://raw.githubusercontent.com/intarchboard/e-impact-workshop-public/main/papers/Moran-Birkholz-Bormann_Sustainability-considerations-for-networking-equipment.pdf.pdf

Thrift

Primitive types: bool byte 16/32/64-bit double string. Composite types: list<t1>, set<t1>, map<t1, t2>. [14]

Avro

Schema is expected to always be available with the data (sender and receiver must share exactly the same schema; but it could be sent with the data). No code generation.

Schema in JSON. Example schema fragment:

{ "type": "record", "name": "BankDepositMsg", "fields" :
  [
    {"name": "user_id", "type": "int"},
    {"name": "amount", "type": "double", "default": "0.00"},
    {"name": "datestamp", "type": "long"} ]
}

Primitive types: null, boolean, int, long, float, double, bytes, string. Composite types: records, enums, arrays, maps, unions, fixed. [15]

Cap’n Proto

Sort-of successor to protobuf (Kenton Varda, the author of Cap'n Proto, was the primary author of Protocol Buffers version 2, a rewrite of Protobuf which was the version that Google released open source [16]).

Zero-copy encoding.

Resiliant to malicious input [17].

Supports random access reads without parsing [18].

New fields ignored by old binaries. Unknown fields retained. [19]

"Iirc Cap'n Proto was the only protocol which is intended for mutually distrusting client and server, and considers the RPC mechanism fairly central to it's value statement. I think the Flatbuffers equivalent to the latter is gRPC, but I don't think it or SBE have an answer to the former. That said I assume any implementation of Flatbuffers or SBE could be checking for malicious/malformed input." -- https://www.reddit.com/r/rust/comments/daja9b/binary_format_shootout_capn_protoflatbuffers_and/

"Cap'n proto is amazing, but sadly does not get nearly the support that protobuf/grpc does. The crate is updated less, and, more importantly, generates far more complex code - working with capnproto in rust, today, is extremely unpleasant with not a lot of examples or docs out there." -- [20]

"To avoid memory allocation with each iteration, you should pass a scratch buffer to MallocMessageBuilder?'s constructor. The scratch buffer can be allocated once outside the loop, but you need to create a new MallocMessageBuilder? each time around the loop." -- https://stackoverflow.com/a/61370743

"To serialize new messages, Cap’n Proto uses a “builder” object. This builder allocates memory on the heap to hold the message content, but because builders can’t be re-used, we have to allocate a new buffer for every single message. I was able to work around this with a special builder that could re-use the buffer, but it required reading through Cap’n Proto’s benchmarks to find an example, and used std::mem::transmute to bypass Rust’s borrow checker." -- [21]

Links:

FlatBuffers

https://google.github.io/flatbuffers/index.html https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html

Zero-copy encoding.

New fields ignored by old binaries. Unknown fields cannot be copied. [22]

Supports random access reads without parsing [23].

"Protocol Buffers is indeed relatively similar to FlatBuffers?, with the primary difference being that FlatBuffers? does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation." -- [24]

" FlatBuffers?, like Protobuf, has the ability to leave out arbitrary fields from a table (better yet, it will automatically leave them out if they happen to have the default value). Experience with ProtoBuf? has shown me that as data evolves, cases where lots of field get added/removed, or where a lot of fields are essentially empty are extremely common, making this a vital feature for efficiency, and for allowing people to add fields without instantly bloating things.

Cap'n'Proto doesn't have this functionality, and instead just writes zeroes for these fields, taking up space. They recommend using an additional zero-compression on top of the resulting buffer, but this kills the ability to read data without first unpacking it, which is a big selling point of both FlatBuffers? and Cap'n'Proto. So Cap'n'Proto is either not truly "zero-parse", or it is rather inefficient/inflexible, take your pick. " -- [25]

"A major caveat of flatbuffers is that it stables a large vtable onto the front of every message, which can have unpleasant effects on size if your messages are small. It's not really intended for networking, but rather for large data files that repeat with many repeated records in them that can share a single vtable." -- https://www.reddit.com/r/rust/comments/daja9b/binary_format_shootout_capn_protoflatbuffers_and/f22p0e8/

"Simply specify the data encoding you are sending with a ‘Content-Type: x-circonus-<datatype>-flatbuffer’ header and the remote end can even avoid copying or even parsing any memory; it just accesses it. The integration into our C code is very macro-oriented and simple to understand." -- https://www.circonus.com/2017/11/some-like-it-flat/

Q: "What do you think is the best buffer protocol to use for multiplayer games? We used Protobuf for a fast-paced .io game, but encoding-decoding turned out to be pretty slow and generated a lot of garbage in JS. We were in the process to switch to FlatBuffers? (before the company went bankrupt), but the syntax made it feel harder to use compared to Protobuf, not sure about the performance though (we expected it to be faster and less garbage created because of the zero-copy).

So, would you recommend Protobuf, Cap'n'Proto, FlatBuffers? or FlexBuffers? for multiplayer games? The usual packets are game states or user input sent at high frequency. "

A: "I originally designed FlatBuffers? for games (though admittedly more for things like level data or save game data than network packets), so I'd think it is pretty suitable. I had actually used Protobuf on a game project just before, and its performance problems led directly to the no-unpacking no-allocation design that FlatBuffers? has.

So FlatBuffers? will make an incoming packet waaay faster to work with than Protobuf. On the downside, Protobuf tends to be a little smaller, so if bandwidth is a greater concern than (de-)serialization speed, you might still prefer it. Additionally, receiving data over the network raises the question of how you handle packets that have been corrupted (or intentionally malformatted by an attacker), and in the case of FlatBuffers? you'd need to at least run the "verifier" over the packet before accessing it if you don't want your game servers to crash when this happens. That slows it down a little bit, but is still fast, i.e. still doesn't allocate etc.

Cap'n Proto will perform similarly, though does have the downside that all fields take space on the wire, regardless of whether they're set or not. So which is better depends on the kind of data you want to send and how it is likely to evolve.

Frankly, for the absolute highest performance (and lowest bandwidth) game networking you still need a custom encoding. " -- https://news.ycombinator.com/item?id=23591749

Flatbuffers claims to have a simpler encoding than CapnProto? according to [26]

Links:

FlexBuffers

https://google.github.io/flatbuffers/flexbuffers.html

Part of Flatbuffers; schema-less.

"A type byte is made up of 2 components (see flexbuffers.h for exact values):

2 lower bits representing the bit-width of the child (8, 16, 32, 64). This is only used if the child is accessed over an offset, such as a child vector. It is ignored for inline types.
6 bits representing the actual type (see flexbuffers.h).

Thus, in this example 4 means 8 bit child (value 0, unused, since the value is in-line), type SL_INT (value 1). " -- https://google.github.io/flatbuffers/flatbuffers_internals.html

SBE

Simple Binary Encoding. Focus on low-latency.

Zero-copy encoding.

https://github.com/real-logic/simple-binary-encoding/wiki/Design-Principles

specifying message formats:

" SBE (Simple Binary Encoding) was designed by a former Protobuf developer for the Chicago Market Exchange group to be used for low-latency trading; its deserialization performance is therefore very much a central concern, even in the presence of forward/backward compatibility.

There's a fixed cost penalty in SBE to handle forward/backward compatibility. The format is simple:

    Messages have a dynamic size and contain first Fields and then lists of Groups.

    Groups have a dynamic size and contain Fields.

    Fields have a fixed size.

A new Field can be added at the end of the Fields of a Message or Group. A new list of Groups can be added at the end of a Message.

Detecting the presence (or absence) of a given Field in a Message or Group is done by comparing its fixed offset + fixed size to the dynamic size of the Message or Group: if it is less-than-or-equal, it is present, otherwise it is absent.

For the set of features supports (which notably excludes arbitrary nesting), SBE is the fastest protocol I've ever implemented.

If you need arbitrary nesting, however... well, you won't be able to use SBE. " -- [27]

"...my intuition is that SBE will probably edge Cap’n Proto and FlatBuffers? on performance in the average case, due to its decision to forgo support for random access" -- [28]

"Simple Binary Encoding has the simplest encoding..." between Cap’n Proto, FlatBuffers?, SBE -- [29]

"Both Cap’n Proto and Flatbuffers use message offsets to handle variable-length data, unions, and various other features. In contrast, messages in SBE are essentially just structs; variable-length data is supported, but there’s no union type." -- https://speice.io/2019/09/binary-format-shootout.html

"After a series of internal bake-offs, we found that Simple Binary Encoding (SBE) performed comparably or better than other well-known formats (like Cap’n Proto and FlatBuffers?) in terms of encode and decode times. In particular, SBE stood out for its low timing variability." -- https://web.archive.org/web/20190427124806/https://polysync.io/blog/session-types-for-hearty-codecs/

Links:

https://speice.io/2019/09/binary-format-shootout.html

Transit

Core types: strings, booleans, integers (to 64 bits w/o truncation), floats, nil/null, bytearray, arrays, maps (with arbitrary scalar keys, not just strings).

Extension types: timestamps, UUIDs, URIs, arbitrary precision integers and decimals, symbols, keywords, characters, quoted values, sets, lists, arrays, hypermedia links, maps with composite keys. Also, custom extension types.

No reference types, nor identity. Not "a system for representing object graphs" [30].

An encoding on top of JSON or MessagePack?.

Links:

YAML

Not Turing-complete

Popular as a configuration language.

YAML can encode cyclic data structures [31].

Indentation is significant.

"(YAML) goes to great lengths to provide human-friendly features, trading off computer-friendliness to an fairly extreme extent. Eliminating 'plain' scalars (unquoted strings-as-values), folded multiline literals, tags, anchors/aliases, and possibly directives, as a sort of reduced yaml would make the language a lot less silly for the kinds of things a lot of people end up using it for. " [32]

"is still the best 'JSON with better layout' out there, which is what we want when we want readable documents and maintainable code. EDN, Transit and TOML all fail in that somewhere, IMO, and of course popularity is actually a hugely important feature for any data interchange format." -- [33]

YAML is supposed to be a superset of JSON: "YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model. This is also the case in practice; every JSON file is also a valid YAML file. This makes it easy to migrate from JSON to YAML if/when the additional features are required." -- YAML: Relation to JSON. However, it misses the mark a little: "Please note that YAML has hardcoded limits on (simple) object key lengths that JSON doesn't have and also has different and incompatible unicode character escape syntax... YAML also does not allow \/ sequences in strings" -- [34] via [35]

Opinions/gotchas:

https://arp242.net/weblog/yaml_probably_not_so_great_after_all.html
"I think the core was brilliantly designed. If you put two hierarchical documents side by side - one in TOML and another in YAML the YAML one is much, much clearer and cleaner." -- [36]
nodes/anchors can be used to "...take repetitive YAML and make it DRY. That appears to be how it is used, e.g. in gitlab's ci YAML."
implicit types can "cause surprise type conversions". Otoh YAML has traditionally been used as the basis of higher-level configuration files for particular applications. What I'm saying is that implicit typing should be permitted, but delegated to those applications. Conversely, I'm not saying that StrictYAML? should do anything by default with unquoted values, except reporting them to the application as being an unquoted value. This way the application could choose to process the value differently from those that are quoted." [37]
"The implicit typing rules (ie, unquoted values) should have been application dependent." [38]
https://arp242.net/yaml-config.html
"...certain strings (like 'on') need quoting if you don't want it to interpret them as something else."
"...some YAML footguns like the country code for Norway being interpreted as a boolean are reasonably famous, and I think widely regarded as a bad idea."
https://noyaml.com/
https://john-millikin.com/json-is-not-a-yaml-subset
- https://lobste.rs/s/equcp2/json_is_not_yaml_subset
https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell
https://news.ycombinator.com/item?id=34351503
https://lobste.rs/s/nsymer/yaml_document_from_hell
" There’s a misunderstanding about YAML’s complexity, it comes up often, and it’s in the first paragraph here. YAML aimed to be a friendly alternative to XML. It’s not 1:1 with XML but kinda wanted to support at least as much. JSON didn’t quite exist yet, though I recognize that later versions of YAML have tried to do a “superset of JSON” thing, but I’ve basically only ever heard the creators mention that at all. YAML was aiming for highly expressive, readable, and round-trip native serialization, so it has this broad set of features and misfeatures. You can easily represent a memory cycle with YAML pointers; you can use event-based parsing and have several distinct documents in a stream and that’s part of the core spec. JSON is pleasantly minimal in comparison but it never wanted to support all of that. But no one looked at the tiny JSON spec, decided it didn’t have enough multiline string options or obscure hash-in-array-or-was-it-array-in-hash whitespace quirks, and sketched up YAML over it. And also, outside of core JSON, it’s grown competing specs for chained documents (json-seq, json-lines), or standards to allow comments, or commas, or to tag native data types for round-trip serialization. We couldn’t leave well enough alone! It could’ve been so simple… Both are terrible for config files. "-- https://lobste.rs/s/nsymer/yaml_document_from_hell#c_llgodm
https://changelog.com/posts/xml-better-than-yaml
"Ease of reasoning about configuration file formats is vastly more important than conveniences for writing specific values. Implicit conversion among types beyond very basic lifting of integer types is a bad idea, especially for configuration file formats. Grammars for configuration file formats should be simple enough to write a complete, correct grammar as a one day project." -- https://lobste.rs/s/fue8mw/xml_is_better_than_yaml#c_9r5zru
"The best feature of Yaml is that you can ignore it and use JSON instead." (since 1.2, apparently: https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_2pqbiv ) -- [39]
https://noyaml.com/

Retrospectives:

https://news.ycombinator.com/item?id=17361188 and https://news.ycombinator.com/item?id=17358548 and https://news.ycombinator.com/item?id=17359309

StrictYAML

Simpler than YAML:

https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst

Removes:

Implicit typing
Binary data
Explicit tags
Node anchors and refs
Flow style
Duplicate keys

Opinions:

https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst Compares StrictYAML? to TOML, JSON5, HJSON, and SDLang.
"Thank you for StrictYAML? I might just use it. It does look like a nice hair cut. You might wish to give Ingy a ring. He has been itching to move forward on a reduced/secure YAML subset. That said, StrictYAML? seems to be a tad bit more of a hair cut than I'd imagine. I'd keep nodes/anchors, since I think a graph storage model is underrated; I think that data processing techniques just haven't caught up with graph structures. Further, I'm not sure everything can be easily typed based upon a schema. Hence, I'm not sure about completely dropping implicit types, perhaps you may want to provide a way for applications to resolve them if they wish. For example, an application may want to attempt to treat anything starting with "[" or "{" as JSON sub-tree. Perhaps keeping "!tag" but handing it off to the application to resolve might also be a good idea in this regard. Even so, typing should be done at the application level and default to something very boring. " [40]
https://hitchdev.com/strictyaml/why/implicit-typing-removed/
- discussion: https://lobste.rs/s/xa3gs7/norway_problem
"StrictYAML? is one toke over the line when they eliminated inline flow collections. YAML should be a superset of JSON..." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_onuwbh
"Yes, and it’s great if you’re in Python and dealing with YAML, but it is not its own spec, so you have to count on using that specific library and language." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_hpvzht

TOML

https://github.com/toml-lang/toml

"essentially an extended version of INI which allows the expression of both hierarchical and typed data." [41]

Discussion:

https://news.ycombinator.com/item?id=17513770
"TOML is good for data layed out with TOML. Representing arbitrary nested arrays and tables gets messy. Also, the constraint on homogenous shallow types has impacted me in some cases." -- [42]
"TOML is almost perfect. The only things I don't like are the need for commas and the double brackets." -- [43]
TOML vs YAML example
https://dystroy.org/blog/hjson-in-broot/
"It seems to lack the features that are the reason that I like UCL, specifically well-defined composition. UCL defines include semantics with the option to add, replace, or remove nodes from the prior tree. This makes it easy to split configuration into multiple files and provides things like default, overridden by machine-generated files, overridden by human-edited files. You can do the config.d-style thing very easily, including every file in a directory and having each one define a set of nodes that it adds or replaces. With TOML, as far as I can see, you’re also stuck with a single config file, which is fine for simple programs but very painful for something with large config files. You need to build this on top and that means that you lose the ability to treat the config files as generic. " -- https://lobste.rs/s/mkcjiz/toml_tom_s_obvious_minimal_language#c_6mewzt

HCL

https://github.com/hashicorp/hcl2

UCL

https://github.com/vstakhov/libucl

Opinions:

" I wish more things would adopt UCL for configuration. Like YAML, it is a representation of the JSON object model but it also has a number of features that make it more useful as a configuration language: Macros. Include files. Explicit merging rules for includes (replace objects, add properties to objects). Cryptographic signing of includes, so you can use semi-trusted transports for them. Syntactic sugar for units " -- [44]

"FreeBSD? meanwhile has standardized on UCL which has the nice property of being a superset of JSON while also allowing nginx-like friendly syntax." -- [45]

UCG

https://ucg.marzhillstudios.com/

SAN

https://news.ycombinator.com/item?id=18023105

S-expressions (sexps)

Not Turing-complete (note: Lisp uses S-expressions but is built on top of them)

Represents trees.

"Canonical" S-expressions (csexps)

example:

(4:this22:Canonical S-expression3:has1:55:atoms)

"a binary encoding form of a subset of general S-expression (or sexp)...The particular subset of general S-expressions applicable here is composed of atoms, which are byte strings, and parentheses used to delimit lists or sub-lists. These S-expressions are fully recursive. ... While S-expressions are typically encoded as text, with spaces delimiting atoms and quotation marks used to surround atoms that contain spaces, when using the canonical encoding each atom is encoded as a length-prefixed byte string. No whitespace separating adjacent elements in a list is permitted. The length of an atom is expressed as an ASCII decimal number followed by a ":". ... A csexp includes a non-S-expression construct for indicating the encoding of a string, when that encoding is not obvious. Any atom in csexp can be prefixed by a single atom in square brackets – such as "[4:JPEG]" or "[24:text/plain;charset=utf-8]". " [46]

pros:

Uniqueness
Support for binary data: Atoms can be any binary string
Support for type-tagging encoded information

Links:

https://en.m.wikipedia.org/wiki/Canonical_S-expressions

Comparisons

"The FlatBuffers? encoding is based on vtables and is relatively straightforward (the runtime library is tiny). This also means it's inefficient for small messages, but in my testing its vtable deduplication worked great for my use case (~100k messages of the same type per memory-mapped file), in that the vtable overhead tends quickly to zero. Cap'n Proto has a more complex encoding that is probably more efficient in terms of wire size, and particularly for small/standalone messages, but the runtime is larger as a result." [47]
Blog post by the Cap'n Proto guy comparing Cap'n Proto, Flatbuffers, Protobuf, and SBE: https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html
"Thrift to me is a Protobuf clone. Very similar design. Early on it was not very well optimized compared to protobuf, but I imagine they've fixed that by now. The big advantage they had was that they included an RPC system in their first release -- although it was a FIFO RPC system which struck me as an odd design choice (I think they may have fixed this more recently?). But now GRPC exists, so I don't think there's much reason to choose Thrift over Protobuf/GRPC. However, I am obviously very biased. :)" [48] (writer was a designer of Protobuf v2 and Cap'n Proto)
"Cap'n'Proto promises to reduce Protocol Buffers much like FlatBuffers? does, though with a more complicated binary encoding and less flexibility (no optional fields to allow deprecating fields or serializing with missing fields for which defaults exist). It currently also isn't fully cross-platform portable (lack of VS support). msgpack: has very minimal forwards/backwards compatibility support when used with the typed C++ interface. Also lacks VS2010 support. Thrift: very similar to Protocol Buffers, but appears to be less efficient, and have more dependencies. YAML: a superset of JSON and otherwise very similar. Used by e.g. Unity." [49]
some benchmarks at http://google.github.io/flatbuffers/md__benchmarks.html
"Protocol Buffers can be used in the widest array of languages, followed by Apache Thrift, followed by Apache Avro" [50]
Avro vs (Thrift or Protobuf): "On the wire/spindle, one of the differences between Avro and Thrift (or PB) is that Avro requires that the schema is always attached (in some way) to the data. For example, let's say you have two schemas, A and B (which, for sake of example, are not related to each other). In Thrift, clientA could open up a socket to serverB, and start chatting away. Behavior is undefined. In Avro, clientA and serverB would try to negotiate a schema, and it would turn out that they're not compatbile. Avro would error out. Analogous stuff happens in Avro Data Files: the schema is stored in the file, so it's impossible to read a B-record from an A-file. Having the schema around let's Avro encode data a little bit differently. In Thrift/PB, the data is encoded as sequences of 3-tuples, like so: (type enum, field id integer, data bytes). In this way Thrift/PB carry around enough schema to parse any Thrift/PB byte stream, though not necessarily know the original fields, or distinctions between signed and unsigned. In Avro, since the schema is available, you can just have a sequence of data bytes. (For example, if your schema has two fields, both integers, in Avro, you'd just see two integers, appropriately encoded, but in Thrift/PB, there'd be bits for the field indexes.) For many use cases, this makes Avro more compact, but it turns out to be less compact for optional-heavy structures. (Say you have a structure where each record has exactly one of seventeen different embedded log records. In PB/Thrift, you'd model that as a record with seventeen optional fields, and only one of them would have its tag and data represented. In Avro, you'd have 16 null indicators as well as the original data. To get the bytes back, you model it as a union in Avro, but then you get the index indicators into the union back...)" [51]
comparison of Protobuf, Thrift, and Avro: http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro
Schema evolution in Avro, Protocol Buffers and Thrift: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
EDN, Fressian, Transit:

" edn is the best choice for human-readable data. It is however, less efficient to transmit and depends on writing a high-performance parser - this is a high bar in some language environments. edn is most attractive right now to Clojure users b/c of its proximity to Clojure itself. While it has many advantages as an extensible readable literal data format, it's an uphill battle to sell that against other data formats that already have greater mindshare and tooling in other language communities.

fressian is the highest performance option - it takes full advantage of a number of compression tricks and has support for arbitrary user-extensible caching. Again, it requires a fair amount of effort to write a fressian library though so it's probably best for JVM-oriented endpoints right now. By seeking greatest performance, fressian also makes tradeoffs that narrow its range of use and interest group.

transit is a pragmatic midpoint between these two. It focuses, like fressian, on program-to-program data transmission however, the data can be made readable (like edn) via the json-verbose mode. Like fressian, transit contains caching capabilities but they are more limited and not user-extensible. transit is designed primarily to have the most high-quality implementations per lowest effort - effectively shooting for greater reach than either edn or fressian by lower the bar to implementation. The bar is lowered by reusing high-performance parser for either JSON or messagepack which exist in a large number of languages. Of particular importance is leveraging the very high performance JSON parsers available in JavaScript? runtimes, making transit viable as a browser-side endpoint for a fraction of the effort required to write a high performance edn or fressian endpoint. As transit explicitly seeks reach and portability, it is naturally the format with the broadest potential usage. "

-- https://groups.google.com/forum/#!topic/clojure/9ESqyT6G5nU

"Transit sounds like an evolution of EDN and Fressian: make the bottom layer pluggable to support human-readable/browser-friendly JSON or use the well-established msgpack for compactness. Caching is still there, but it can only be used for keywords/strings/symbols/etc. instead of arbitrary values like Fressian -- probably a good trade-off for simplicity." [52]

Transit vs Messagepack: " The biggest difference is that MessagePack? extensibility (which is not yet widely implemented) is based upon binary blobs, whereas Transit defines extensions in terms of other Transit types. Also, Transit can reach the browser via JSON. And Transit has caching... " [53]
Transit vs Messagepack (and MessagePack? vs JSON): "MessagePack? implementations in JavaScript? get trounced by JSON for read/write performance and JavaScript? is a pretty important part of the puzzle for many people building systems these days. Transit on the other hand can best JSON on more recent JS engines and also in a bind I'd rather debug Transit verbose JSON output than MessagePack? :) " [54]
Transit vs EDN: "Transit advantages over EDN are mostly those of performance and scope that fall out of using JSON and MessagePack? for the underlying serialization format....EDN still has utility in Clojure/ClojureScript? programs - it's more natural i.e. configuration. However for communication between disparate systems Transit has many advantages." [55]
Messagepack vs Transit: "MsgPack? - limited data types (no URLs, Dates, etc) " [56]
Transit vs binary formats: "If you only use a binary data format to convey values in your heterogenous system, you are unlikely to be the target consumer of Transit. However, if some components of your system marshal JSON and you would prefer something comparable in performance to JSON when communicating with those components, then Transit is well worth thinking long and hard about." [57]
discussion including poll: https://github.com/apex/up/issues/83 (poll currently has, in order: JSON, YAML, TOML, HCL)
discussion: https://lobste.rs/s/dn91bz/why_broot_is_switching_from_toml_hjson_for
https://nathanleclaire.com/blog/2016/06/13/yaml-hcl-toml-and-other-fantastic-beasts/
- https://lobste.rs/s/4ta50f/yaml_hcl_toml_other_fantastic_beasts
https://kevin.burke.dev/kevin/more-comment-preserving-configuration-parsers/
https://www.zionandzion.com/json-vs-xml-vs-toml-vs-cson-vs-yaml/
https://users.rust-lang.org/t/why-does-cargo-use-toml/3577/13
https://wiki.alopex.li/BetterThanJson
- concludes that the following are the best of breed: JSON, Protobuf, Cap’n Proto, Flatbuffers, CBOR, msgpack

Moon

https://github.com/jordanorelli/moon

BayesDB

links:

http://probcomp.csail.mit.edu/bayesdb/

Alan's data modeling language

See https://alan-platform.com/pages/tuts/getting-started.html

Dhall

"A configuration language guaranteed to terminate"

https://github.com/dhall-lang/dhall-lang

Opinions:

weakly recommended by [58]
"For example, Dhall (https://dhall-lang.org/) could be seen as a subset of Haskell, with less features but guaranteed termination!" [59]
good explanation of why these "configuration languages" are useful: https://www.saurabhnanda.in/2022/03/24/dhall-a-gateway-drug-to-haskell/
- https://lobste.rs/s/1jjk8u/dhall_gateway_drug_haskell
  - vs. CUE: "My impression after messing around in this space is that CUE is a better fit for the task of configuration" (compared to Dhall) -- [60]
"I like nickel. Dhall is too complex when many config engineers not only aren’t devs or dev minded, I work with a guy who recently stated in a team meeting “I have no interest in technology”. I’d like to think he’s an anomaly, but these days many people get into the industry just for the salary." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_axorpd

Dhall links

Nickel

https://www.tweag.io/blog/2022-03-11-nickel-first-release/

https://github.com/tweag/nickel/

https://github.com/tweag/nickel/#related-projects-and-inspirations

https://github.com/tweag/nickel/blob/master/RATIONALE.md

https://lobste.rs/s/hskr5v/first_release_nickel

Opinions:

"I like nickel. Dhall is too complex when many config engineers not only aren’t devs or dev minded, I work with a guy who recently stated in a team meeting “I have no interest in technology”. I’d like to think he’s an anomaly, but these days many people get into the industry just for the salary." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_axorpd

Typedefs

https://typedefs.com/

"programming language agnostic, algebraic data type definition language"

https://github.com/typedefs/typedefs/blob/master/TUTORIAL_UNDERSTANDING_TYPEDEFS.md

Cue

https://cuelang.org/

A scheme language. Works with YAML and JSON.

Discussion/opinions:

https://news.ycombinator.com/item?id=20847943
Cue vs. Jsonnet: https://github.com/cuelang/cue/issues/33
"...replacing inheritance as the fundamental compositional primitive with constraint unification" -- [61]
https://news.ycombinator.com/item?id=20362951
"I've tried Cue, it gives you a type system for static JSON, but the support for functions seems either absent or very, very convoluted." [62]
vs. Dhall: "My impression after messing around in this space is that CUE is a better fit for the task of configuration" (compared to Dhall) -- [63]

Lua

Lua is not a data language but can be used as one:

Opinions:

https://boston.conman.org/2023/09/29.1
- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_6eolyt
"I use at least two great tools that use lua configs. I don’t like reading or writing those configs. I end up mangling the code so I can pretend it’s just a config format when I read it. As I will say in any YAML-alt post I notice, I want a config file to be less of a program, not more. So my favorite alternative by far is NestedText?, which has no types beyond string, list, and dict, leaving code-concerns to real code (elsewhere)." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_kxgktn

Recfile

blog entry: https://labs.tomasino.org/gnu-recutils/

Jsonnet

https://jsonnet.org/

Discussion/opinions:

" I guess no one mentioned jsonnet yet, but it's probably the most popular thing in use (along with ksonnet). But really, it's the most awful of the lot - where you have to program using a strange bastardization of JSON (Hello XML nightmares of the 90s/00s). Try using that for a while, and you'll be begging to use dhall-lang." [64]
"Also Jsonnet, which I implemented into a large-scale project when a nest of YAML files became unwieldy. Was surprisingly delightful to work with." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_jgv19a
- "Can’t endorse Jsonnet (from experience - jsonnet is a huge source of complexity at my day job). Jsonnet is a huge PITA to debug, it’s easy to structure poorly because it’s dynamically typed, and to make itself basically JS it has it’s own standard library." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_hbeyju

Starlark

Configuration language in Bazel.

Apache Arrow

https://arrow.apache.org/

"specifies a standardized language-independent columnar memory format for flat and hierarchical data,...also provides computational libraries and zero-copy streaming messaging and interprocess communication...the de-facto standard for columnar in-memory analytics...backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm"

" Arrow is based on Flatbuffers, Parquet is based on Thrift. " -- https://news.ycombinator.com/item?id=23594874

Ion

https://amzn.github.io/ion-docs/

Opinions:

Pup

https://github.com/liljencrantz/crush/blob/master/src/crush.proto

Oheap

http://www.oilshell.org/blog/2017/01/09.html http://www.oilshell.org/blog/2018/12/16.html#toc_3

ASDL

A language for describing ASTs. Software is available to autogenerate code that implements ASTs described by ASDL.

Used in Python and in Oil shell.

http://www.oilshell.org/blog/2016/12/11.html

http://www.oilshell.org/blog/2016/12/16.html

https://www.cs.princeton.edu/research/techreps/TR-554-97

https://raw.githubusercontent.com/eliben/asdl_parser/master/docs/ASDL%20Zephyr%20-%20Wang.pdf

http://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compilers

https://github.com/eliben/asdl_parser

http://asdl.sourceforge.net/

" Analogies for ASDL 2016-12-16

If you haven't used Google's protocol buffer serialization technology, this analogy may be helpful:

JavaScript? Data Model : JSON :: C Data Model : Protocol Buffers

Just as JSON is a language-independent serialization format extracted from JavaScript?'s data model (objects, heterogeneous arrays, strings, numbers, booleans), protocol buffers are a mostly language-independent serialization format extracted from C's data model:

    structs (messages)
    homogeneous arrays (repeated fields)
    strings
    enums
    double and float
    unsigned and signed integers of various widths.

A similar analogy explains Zephyr ASDL, which I explained from a few other angles in the last post:

C data model : Protocol Buffers :: ML data model : ASDL

ML is the language that introduced algebraic data types or ADTs. ADTs are a characteristic feature of strongly-typed functional languages like Standard ML, OCaml, and Haskell.

ASDL, like protocol buffers, is a domain-specific language that describes a language-independent serialization format for a particular data model -- in this case, the ML data model. It has the following constructs:

    Product types, aka records, representable by structs in C and C++.
    Sum types, aka variants, representable by tagged unions in C or subclasses in C++.
    Optional fields, representable by a pointer that may be null. In ML-like languages, they're the Option type.
    Repeated fields, representable by arrays in C++. In ML-like languages, they're lists.
    Strings.
    Integers of unspecified width.

" -- http://www.oilshell.org/blog/2016/12/16.html

" ASTs (Abstract Syntax Trees) are an important data structure in compiler front-ends. If you've written a few parsers, you almost definitely ran into the need to describe the result of the parsing in terms of an AST. While the kinds of nodes such ASTs have and their structure is very specific to the source language, many commonalities come up. In other words, coding "yet another AST" gets really old after you've done it a few times.

Worry not, as you'd expect from the programmer crowd, this problem was "solved" by adding another level of abstraction. Yes, an abstraction over Abstract Syntax Trees, oh my! The abstraction here is some textual format (let's call it a DSL to sound smart) that describes what the AST looks like, along with machinery to auto-generate the code that implements this AST.

Most solutions in this domain are ad-hoc, but one that I've seen used more than once is ASDL - Abstract Syntax Definition Language. The self-description from the website sounds about right:

    The Zephyr Abstract Syntax Description Lanuguage (ASDL) is a language designed to describe the tree-like data structures in compilers. Its main goal is to provide a method for compiler components written in different languages to interoperate. ASDL makes it easier for applications written in a variety of programming languages to communicate complex recursive data structures.
    "
     -- https://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compilers

Preserves

https://preserves.gitlab.io/preserves/preserves.html

typed-wire

https://github.com/typed-wire/typed-wire

ATD

https://github.com/ahrefs/atd https://atd.readthedocs.io/en/latest/

"Adaptable Type Definitions" -- "Static Types for Json APIs"

Spot

https://github.com/airtasker/spot

"Spot ("Single Point Of Truth") is a concise, developer-friendly way to describe your API contract."

"Leveraging the TypeScript? syntax, it lets you describe your API and generate other API contract formats you need (OpenAPI?, Swagger, JSON Schema)."

Opinions:

https://news.ycombinator.com/item?id=24996455

OpenAPI / Swagger

OpenAPI? (version 2 was formerly known as Swagger)

Relational pipes

https://relational-pipes.globalcode.info/v_0/

FIDL

https://fuchsia.dev/fuchsia-src/concepts/fidl/overview

Arrow

https://arrow.apache.org/

Opinions:

https://news.ycombinator.com/item?id=26938024 vs protobufs

SimpleSerialize (SSZ)

https://github.com/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md https://github.com/ethereum/eth2.0-specs/blob/a63de3dc374148fe8adacd8718f67f8c7ba54f2e/specs/simple-serialize.md https://rauljordan.com/2019/07/02/go-lessons-from-writing-a-serialization-library-for-ethereum.html

RLP

https://eth.wiki/en/fundamentals/rlp https://github.com/ethereum/wiki/wiki/RLP

https://ethresear.ch/t/replacing-ssz-with-rlp-zip-and-sha256/5706

Kaitai

https://kaitai.io/ https://lobste.rs/s/pnfkzp/kaitai_struct_declarative_binary_format

NestedText

https://nestedtext.org/

Only strings, lists, dicts are supported.

Comparison to JSON, YAML, TOML, INI, CSV, TSV: https://nestedtext.org/en/latest/alternatives.html#yaml

Opinions:

" I don't like YAML and would like to move on, but I hope we don't move onto this. I think it's crazy that when I add a string to an inline list, I may need to convert that inline list to a list because this string needs different handling. I think it's crazy that "convert an inline list to a list" is a coherent statement, but that is the nomenclature that they chose. I don't like that a truncated document is a complete and valid document. But what is most unappealing is their whitespace handling. I couldn't even figure out how to encode a string with CR line endings. So, I downloaded their python client to see how it did it. Turns out, they couldn't figure it out either: >>> nt.loads(nt.dumps("\r"),top="str") '\n' " -- [65]
"I use at least two great tools that use lua configs. I don’t like reading or writing those configs. I end up mangling the code so I can pretend it’s just a config format when I read it. As I will say in any YAML-alt post I notice, I want a config file to be less of a program, not more. So my favorite alternative by far is NestedText?, which has no types beyond string, list, and dict, leaving code-concerns to real code (elsewhere)." -- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua#c_kxgktn

ZSON / Zed

https://zed.brimdata.io/docs/formats/zson/

Data language concepts, patterns, and practices

https://en.m.wikipedia.org/wiki/Type-length-value

Data language links

Schema exchanges

https://schema.org/

Concise-encoding

https://concise-encoding.org/

discussion:

https://news.ycombinator.com/item?id=31475779

Server-Sent Events

" The protocol is very simple. It uses the text/event-stream Content-Type and messages of the form:

data: First message

event: join data: Second message. It has two data: lines, a custom event type and an id. id: 5

: comment. Can be used as keep-alive

data: Third message. I do not have more data. data: Please retry later. retry: 10

Each event is separated by two empty lines (\n) and consists of various optional fields.

The data field, which can be repeted to denote multiple lines in the message, is unsurprisingly used for the content of the event.

The event field allows to specify custom event types, which as we will show in the next section, can be used to fire different event handlers on the client.

The other two fields, id and retry, are used to configure the behaviour of the automatic reconnection mechanism. This is one of the most interesting features of Server-Sent Events. It ensures that when the connection is dropped or closed by the server, the client will automatically try to reconnect, without any user intervention.

The retry field is used to specify the minimum amount of time, in seconds, to wait before trying to reconnect. It can also be sent by a server, immediately before closing the client’s connection, to reduce its load when too many clients are connected.

The id field associates an identifier with the current event. When reconnecting the client will transmit to the server the last seen id, using the Last-Event-ID HTTP header. This allows the stream to be resumed from the correct point.

Finally, the server can stop the automatic reconnection mechanism altogether by returning an HTTP 204 No Content response. " -- [66]

FIDL (Fuchsia Interface Definition Language)

"the language used to describe interprocess communication (IPC) protocols used by programs running on Fuchsia"

https://fuchsia.dev/fuchsia-src/concepts/fidl/overview

Discussions:

https://lobste.rs/s/mrxgig/fuchsia_idl_overview

Concise Encoding

https://concise-encoding.org/

discussion:

Hay

https://www.oilshell.org/release/0.11.0/doc/hay.html

(part of the Oil shell project)

vs. TOML and Cue: " TOML is a data-only language. There are no functions / loops / conditionals. That is totally fine, but the minute you want to start “templating” it (like YAML/Go templates), I would say that is a smell. I mention in the doc that Hay is for the cases where you outgrow “plain old data” (which IMO happens to every system when it gets big enough). Cue is one of the more interesting config languages (i.e. it is NOT the “JSON with lambda/map/filter” design I dislike). As far as I understand, it does validation with a logic programming model. I think this could be useful for some things, but I do think “regular Python-like code” is more general – I feel like you will have to mix Cue with something else for most apps ? But I’d definitely like to hear from people who have success with Cue. " -- [67]

3D (Dependent Data Descriptions)

https://www.fstar-lang.org/papers/EverParse3D.pdf

Preserves

https://preserves.dev/

Protobuf ASCII

https://rachelbythebay.com/w/2023/10/05/config/

Opinions:

https://lobste.rs/s/f37hri/ascii_protocol_buffers_as_config_files

misc notes

"From the Java world comes Thrift, its successor Avro, and MessagePack?. From Python we have pickle, which somehow has escaped the Python world to inflict harm upon others. From the C and C++ world we have Cap’n Proto, Flatbuffers, and perhaps the most popular, Google Protobuf (the heart of the widely adopted gRPC protocol). Now, these serialization libraries might have come from one language world, but they’d be useless without bindings in basically every other language… which they do generally boast, with the exception of pickle.

It should be noted that not all of these serialization libraries stop at serialization. Some bring protocol specification (RPC definition and calling convention) into scope. Notwithstanding that this can be useful, the fact that they are conflated within a single implementation is a tragedy. " -- https://www.circonus.com/2017/11/some-like-it-flat/

" ...for config languages...If we're limiting ourselves to just JSON, INI, XML and YAML as potential choices, I get why people cling onto one of these suboptimal choices and then fiercely defend it, but there are other options. There's libconfig, JSON5, Dhall, various interpreted languages... " -- https://news.ycombinator.com/item?id=37594549

"It's a noble goal to want a simple configuration format, toml is far from the simplest, line separated options format is simpler. The fact that it needs a parser indicates that it creates a bigger parsing problem than necessary." -- https://news.ycombinator.com/item?id=37597502

"...for configuration files...dhall" -- https://news.ycombinator.com/item?id=37598897