proj-plbook-plChDataLangs

Table of Contents for Programming Languages: a survey

Chapter : Data languages

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

DSV-style

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

CSV

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

RFC 822 Format

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

Cookie-Jar Format

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

Record-Jar Format

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

INI

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

XML

Not Turing-complete

Represents trees.

Considered 'heavyweight'.

XML schema languages:

Some distinguishing features:

...

JSON

Not Turing-complete

Considered 'lightweight'.

Schemas:

Opinions:

JSON extensions

ASN.1

https://en.m.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Older format. Considered 'heavyweight', unpopular today.

Messagepack

https://github.com/msgpack/msgpack/blob/master/spec.md

BSON

EDN

https://github.com/edn-format/edn

Not "a system for representing objects - there are no reference types, nor should a consumer have an expectation that two equivalent elements in some body of edn will yield distinct object identities when read" [5].

types:

extensibility through 'tags': a tag starts with '#', and indicates the semantics of the following element. E.g.:

  1. myapp/Person {:first "Fred" :last "Mertz"}

"Upon encountering a tag, the reader will first read the next element (which may itself be or comprise other tagged elements), then pass the result to the corresponding handler for further interpretation, and the result of the handler will be the data value yielded by the tag + tagged element, i.e. reading a tag and tagged element yields one value."

built-in tagged elements:

comments: semicolon denotes a comment to-end-of-line

discard: #_ "indicates that the next element should be read and discarded...(note that the next element must still be a readable element"

Links:

(see also Datalog)

Datalog

Implementation:

Fressian

https://github.com/Datomic/fressian

Protobuf (Protocol Buffers)

Supported by Google.

Version 3 drops required fields and adds maps.

Primitive types: bool 32/64-bit float double string. Composite types: none, but repeated elements, and maps in v3. [6]

Resiliant to malicious input [7].

New fields ignored by old binaries. [8]

Does not support random access reads without parsing [9].

Links:

MessagePack

Bencode

Thrift

Primitive types: bool byte 16/32/64-bit double string. Composite types: list<t1>, set<t1>, map<t1, t2>. [10]

Avro

Schema is expected to always be available with the data (sender and receiver must share exactly the same schema; but it could be sent with the data). No code generation.

Schema in JSON. Example schema fragment:

{ "type": "record", "name": "BankDepositMsg", "fields" :
  [
    {"name": "user_id", "type": "int"},
    {"name": "amount", "type": "double", "default": "0.00"},
    {"name": "datestamp", "type": "long"} ]
}

Primitive types: null, boolean, int, long, float, double, bytes, string. Composite types: records, enums, arrays, maps, unions, fixed. [11]

Cap’n Proto

Sort-of successor to protobuf (Kenton Varda, the author of Cap'n Proto, was the primary author of Protocol Buffers version 2, a rewrite of Protobuf which was the version that Google released open source [12]).

Zero-copy encoding.

Resiliant to malicious input [13].

Supports random access reads without parsing [14].

New fields ignored by old binaries. Unknown fields retained. [15]

Links:

FlatBuffers

Zero-copy encoding.

New fields ignored by old binaries. Unknown fields cannot be copied. [16]

Supports random access reads without parsing [17].

Transit

Core types: strings, booleans, integers (to 64 bits w/o truncation), floats, nil/null, bytearray, arrays, maps (with arbitrary scalar keys, not just strings).

Extension types: timestamps, UUIDs, URIs, arbitrary precision integers and decimals, symbols, keywords, characters, quoted values, sets, lists, arrays, hypermedia links, maps with composite keys. Also, custom extension types.

No reference types, nor identity. Not "a system for representing object graphs" [18].

An encoding on top of JSON or MessagePack?.

Links:

YAML

Not Turing-complete

Popular as a configuration language.

YAML can encode cyclic data structures [19].

Indentation is significant.

"(YAML) goes to great lengths to provide human-friendly features, trading off computer-friendliness to an fairly extreme extent. Eliminating 'plain' scalars (unquoted strings-as-values), folded multiline literals, tags, anchors/aliases, and possibly directives, as a sort of reduced yaml would make the language a lot less silly for the kinds of things a lot of people end up using it for. " [20]

"is still the best 'JSON with better layout' out there, which is what we want when we want readable documents and maintainable code. EDN, Transit and TOML all fail in that somewhere, IMO, and of course popularity is actually a hugely important feature for any data interchange format." -- [21]

Opinions/gotchas:

Retrospectives:

StrictYAML

Simpler than YAML:

https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst

Removes:

Opinions:

TOML

https://github.com/toml-lang/toml

"essentially an extended version of INI which allows the expression of both hierarchical and typed data." [26]

Discussion:

S-expressions (sexps)

Not Turing-complete (note: Lisp uses S-expressions but is built on top of them)

Represents trees.

"Canonical" S-expressions (csexps)

example:

(4:this22:Canonical S-expression3:has1:55:atoms)

"a binary encoding form of a subset of general S-expression (or sexp)...The particular subset of general S-expressions applicable here is composed of atoms, which are byte strings, and parentheses used to delimit lists or sub-lists. These S-expressions are fully recursive. ... While S-expressions are typically encoded as text, with spaces delimiting atoms and quotation marks used to surround atoms that contain spaces, when using the canonical encoding each atom is encoded as a length-prefixed byte string. No whitespace separating adjacent elements in a list is permitted. The length of an atom is expressed as an ASCII decimal number followed by a ":". ... A csexp includes a non-S-expression construct for indicating the encoding of a string, when that encoding is not obvious. Any atom in csexp can be prefixed by a single atom in square brackets – such as "[4:JPEG]" or "[24:text/plain;charset=utf-8]". " [27]

pros:

Links:

Comparisons

" edn is the best choice for human-readable data. It is however, less efficient to transmit and depends on writing a high-performance parser - this is a high bar in some language environments. edn is most attractive right now to Clojure users b/c of its proximity to Clojure itself. While it has many advantages as an extensible readable literal data format, it's an uphill battle to sell that against other data formats that already have greater mindshare and tooling in other language communities.

fressian is the highest performance option - it takes full advantage of a number of compression tricks and has support for arbitrary user-extensible caching. Again, it requires a fair amount of effort to write a fressian library though so it's probably best for JVM-oriented endpoints right now. By seeking greatest performance, fressian also makes tradeoffs that narrow its range of use and interest group.

transit is a pragmatic midpoint between these two. It focuses, like fressian, on program-to-program data transmission however, the data can be made readable (like edn) via the json-verbose mode. Like fressian, transit contains caching capabilities but they are more limited and not user-extensible. transit is designed primarily to have the most high-quality implementations per lowest effort - effectively shooting for greater reach than either edn or fressian by lower the bar to implementation. The bar is lowered by reusing high-performance parser for either JSON or messagepack which exist in a large number of languages. Of particular importance is leveraging the very high performance JSON parsers available in JavaScript? runtimes, making transit viable as a browser-side endpoint for a fraction of the effort required to write a high performance edn or fressian endpoint. As transit explicitly seeks reach and portability, it is naturally the format with the broadest potential usage. "

-- https://groups.google.com/forum/#!topic/clojure/9ESqyT6G5nU

"Transit sounds like an evolution of EDN and Fressian: make the bottom layer pluggable to support human-readable/browser-friendly JSON or use the well-established msgpack for compactness. Caching is still there, but it can only be used for keywords/strings/symbols/etc. instead of arbitrary values like Fressian -- probably a good trade-off for simplicity." [33]

BayesDB

links:

Alan's data modeling language

See https://alan-platform.com/pages/tuts/getting-started.html

Dhall

"A configuration language guaranteed to terminate"

https://github.com/dhall-lang/dhall-lang

Dhall links

Data language concepts, patterns, and practices

Data language links