proj-oot-ootDataNotes2

http://www.w3.org/TR/json-ld/#data-model-overview

http://json-ld.org/learn.html

http://www.slideshare.net/gkellogg1/jsonld-json-for-the-social-web

http://www.slideshare.net/lanthaler/building-next-generation-web-ap-is-with-jsonld-and-hydra

http://www.slideshare.net/lanthaler/jsonld-for-restful-services

http://www.slideshare.net/gkellogg1/json-for-linked-data

https://cloudant.com/blog/webizing-your-database-with-linked-data-in-json-ld/#.VAZAY9aNGZw

---

http://blog.codeclimate.com/blog/2014/06/05/choose-protocol-buffers/

required/optional/repeated are annotations of items in a data structure

numbered fields for versioning


i'm been focusing on graphs/trees/ordered dicts as a generalization of Python's lists and dicts

still, in a way, lists and dicts are nicely complementary ordered dict as faceted dict (the unordered dict facet, and the list facet) so, if you only had unordered dicts, then what would lists buy you? iterators for comprehensive traversal of all items, and ADT destructuring bind (e.g. f(head:tail) = ...)

http://c2.com/cgi/wiki?RelationalLispWeenie

http://c2.com/cgi/wiki?MinimalTable

contrast declarative table access with http://c2.com/cgi/wiki?NavigationalDatabase :

" A term sometimes used to describe NetworkDatabase? (a.k.a. "Codasyl database) and HierarchicalDatabase? systems and techniques. This is because one often has to "navigate" from node to node (object-to-object or record-to-record) by "pointer hopping" or "reference hopping". The navigation is characterized by:

    Explicit directional instructions like "next", "previous", "nextChild", "toParent", etc.,
    Paths (like file paths)
    Common use of explicit (imperative) loops to traverse the structure(s) for aggregate calculations or finding/filtering.

This is in contrast to RelationalDatabase? techniques which tend to use set notation, LogicProgramming?-like or FunctionalProgramming?-like techniques, and the concept of a "table" to logically describe what you want rather than how to navigate to get it. "

---

a relational db can be seen as a generalization of a triplestore to an ntuplestore. But somehow, operations like inner, outer join seem to me to be too low level; sometimes you just want to abstractly consider all available facts about an object; in these cases, the structure in terms of tables should be implicit/lower level. But i guess other times, you want to compute eg a list of all things that a given person has bought.

---

"Pardon me if I'm just sniping on the word "object" here, but if you think of your data as objects then you will find the relational model restrictive.

In my experience, objects are an application concept, closely coupled to an implementation. If you can conceive of your data in implementation-independent terms, i.e. as entities and relationships, then you can put a RDBMS to effective use." -- https://news.ycombinator.com/item?id=8378176

---

http://c2.com/cgi/wiki?AreTablesGeneralPurposeStructures

---

http://www.haskell.org/haskellwiki/Foldable_and_Traversable

---

y'know, using the pandas library was too hard, but a simple step would be just to add names to arrays to make 'data frames', that you could index into using the names instead of the dimensions:

the usual way: 'i'll just remember that dimension 1 (rows) of A is Time and dimension 2 (columns) are things that happen at that time, in order, Heat, Speed' and then:

A[60:1801, 1] (select Speed at times from one minute to half an hour)

instead, the programmer could choose any of:

A[60:1801, 1] A[60:1801, heat] A[time=60:1801, heat] A[time=60:1801, 1]

it's not clear whether it would be better to have the column names as implict keywords in the source code, like i have here, or quoted:

A[60:1801, 1] A[60:1801, "heat"] A[time=60:1801, "heat"] A[time=60:1801, 1]

see also http://www.r-tutor.com/r-introduction/data-frame

i'd like to emphasize that dataframes are not, as i once thought, a system for naming dimensions in arbitrary n-dimensional arrays. Rather, they are a system for 2-D arrays, where each rows is a data point, and the columns are attributes. It's worth noting that this is a similar construct to a relational DB table; in both cases, a data point is a 'row', each row can have many attributes, and each attribute is a 'column'. One difference, if any, is that DataFrames? can have row labels (i guess pandas calls the column/series of row labels an 'index'); these serve a similar function to db Table's primary keys, except afaict the dataframe row label is not, by convention, an actual normal column (not sure about that though). Rows and columns are not quite symmetric, afaict.

also, a common conversion: between a representation where you have a series of attribute values and the row identifiers are implicit, and are implicitly a range of uniformally ascending integers starting from zero (a 'raster' representation), and on the other hand a representation where the row identifiers are explicit yet strictly ascending, may skip some integers, may not start at zero. this is like matlab 'griddata'. Could the 'views' machinery help map one to the other? And remember, for some series with implicit row identifiers starting at zero, what the offset is into 'true' row identifiers which are also integers but which start at some high number? i think so.

(i call it convertLabeledTimesToRaster(times, values) in my internal utils lib)


copied from [1]:

the idea from the end of the previous section is important enough that i wanted to emphasize it in its own section; instead of representing only the choice of a field, 'memory offsets' should represent a whole PATH, that is a choice of a field and then optionally of a subfield, etc. We might interchangably think of a PATH and of a QUERY. Eg .b.c.d (apply this offset/path/query to variable 'a', and you get a.b.c.d); eg [b][c] (apply this offset/path/query to 2-d array 'a', and you get a[b][c]); eg x y (apply the function f to these two arguments, and you get (f x y)).

note that with regard to function application, we are reminded of the 0-ary function issue, which we saw elsewhere also corresponds to the distinction between applying a complete slice index to a table and getting back a table with one element (for consistency with partial slices, which return eg columns of 2-d tables) vs. applying a complete slice index to a table and getting back a non-table value. And also of the issue of __set needing a 'path' through an object, as opposed to __get just needing a value.

And also of the connection between adding offsets to pointers, and getattr (doing a single field selection), and having offsets able to represent paths (nested field selection). In C adding an offset to a pointer would work as a nested field selection iff the fields before the selected field were of known length; but in our case we want to abstract away from that byte length of fields. You might call paths 'queries', which also implies that more complicated, 'dynamic' methods of selection might obtain; basically, anything computation that walks the graph and returns a pointer to a destination node. We see a similar distinction here as between effective address and referent in addressing mode. However 'a pointer to a destination node' is not really a 'path', so we may need to rethink calling these 'paths'; otoh b/c of metaprogramming like __get, __set, in some cases it will be important which nodes you pass thru on the way to another node, because the node you finally reached may be only a virtual node simulated by running __gets and __sets on nodes that you went thru to get here.

---

now if paths ARE important, and should be used as 'effective addresses' so that metaprogramming in the middle of a path has a chance to override, then how does that relate to the more typical idea that effective addresses are pointers, single locations? and if we have paths, we have category theory, so what else does this suggest?

i think it means that we should have equations between paths. That is, commuting diagrams; be able to assert that a.b.c = d.e.f, not just that the value (referent) a.b.c = the value d.e.f, but rather than the location (reference; effective address; path) a.b.c = (the path) d.e.f.

this also allows us to express things like OOP delegation. Which is another instance of having metaprogramming in the middle of a path.

this also ties in with the idea that we should be able to represent, in the language, the concept of one object having multiple ids (one of the key ideas of a Kantian 'object', eg the idea that there is something out there in the world that is somehow connected to various distinct sensory impressions, although of course the Kantian skepticism about what the object is/if there are more than one of them, which was one of his main points, isn't so useful here). Another example is that an item in your database might correspond to an item in someone else's database, but may have different primary keys. The idea that effective addresses = pointers = single locations is related to thinking of things as having a single canonical id, whereas the idea that an effective address is a path and that there may be equations between paths is more 'relative' and is related to recognition that some things may not have any canonical primary id.

---

in fact, ideal data type may be distinct from interfacee (signature); lists that can be resized, and tuples that cannot, are probably the same ideal type even though they support different operations (would this be true even if neither's signature was a subset of the other? probably; imagine lists which could only be leengthened, vs lists that could only be reduced: otoh in that example there is a common core)

---

note that the diagrams in http://research.swtch.com/godata provide a great example of what i mean by adding (topology? geometry?) to a graph: the boxes representing an array are graph nodes, but they are also contiguous and go from left to right, which is important this is probably the key to represeenting arrays in oot directions (leftright vs updown) seem like named args/things that u query/asguments for partiable fn :)/ dimensions in multidim vectors

---

Taw has a blog post that bears directly on Oot:

http://t-a-w.blogspot.com/2010/07/arrays-are-not-integer-indexed-hashes.html

in Oot we treat both arrays and hashes as special cases of Graphs. Taw explains why this can't work. Namely:

" Consider this - what should be the return value of {0 => "zero", 1 => "one"}.select{

k,vv == "one"}?

If we treat it as a hash - let's say a mapping of numbers to their English names, there is only one correct answer, and everything else is completely wrong - {1=>"one"}.

On the other hand if we treat it as an array - just an ordered list of words - there is also only one correct answer, and everything else is completely wrong - {0=>"one"}.

These two are of course totally incompatible. And an identical problem affects a lot of essential methods. Deleting an element renumbers items for an array, but not for a hash. shift/unshift/drop/insert/slice make so sense for hashes, and methods like group_by and partition have two valid and conflicting interpretations. It is, pretty much, unfixable.

"

i think this can be resolved with Views; a slightly encapsulated version of " a monstrosity like PHP where ... half of array functions accept a boolean flag asking if you'd rather have it behave like an array or like a hash." "

---

hmm i think Haskell 'lenses' are like Oot 'Queries' or 'Paths'

https://www.fpcomplete.com/school/to-infinity-and-beyond/pick-of-the-week/a-little-lens-starter-tutorial

note that they generalize 'set' to 'over'; 'over' applies a fn to the target, eg the provided fn takes the current value and what it returns is what the new value is set to eg over(f, x) = x.__set__(f(x.__get__()))

these are cool too:

https://www.fpcomplete.com/school/to-infinity-and-beyond/pick-of-the-week/a-little-lens-starter-tutorial#the-lens-laws-

---

(first-class) 'queries' (=lenses, paths, slices) are closely related to data-binding

note: a 'path' through a graph is a graph way of looking at a reference to a part of a structure (e.g. parent.name); a 'slice' can be multiple 'paths' (e.g. array[2:4] references multiple elements in the array at once) with ordering (like an array), but slices do not usually have complex paths; one could generalize these by combining the two (eg parent.name[2:3] or parent[2:3].name); one could also generalize them by allowing reference to multiple elements at once in a dict-like (or more generally, graph-like) manner instead of just array-like... remember that we need to handle the results of eg SQL queries in this way... also note that we need dataframes (column names, and dimension names)..

---

"variants...are the dual of structs..." -- https://www.quora.com/Do-you-feel-that-golang-is-ugly/answer/Tikhon-Jelvis?srid=hMkC&share=1

--

what are Oot "patterns"? generalizing from regexps, i guess they are languages for boolean expressions with captures? eg 'http://([^ ]+)' is a regexp which is a boolean expression (it 'matches') but it also has a 'capture'.

in oot we want such patterns to also be 'firstclass', so they can be assigned to variables, dynamically built and manipulated, etc

--

in Javascript you can use variables as a way to get a namespace, eg:

var myModule = {
    controller: function() {},
    view: function() {}
}

(example from http://lhorie.github.io/mithril/getting-started.html )

another thing i thought i saw, that i did not, but that might be interesting, would be using function() {} as a way to get new namespaces. What was actually there was:

//define the view-model
todo.vm = {
    init: function() {
        //a running list of todos
        todo.vm.list = new todo.TodoList();

        //a slot to store the name of a new todo before it is created
        todo.vm.description = m.prop('');

        //adds a todo to the list, and clears the description field for user convenience
        todo.vm.add = function(description) {
            if (description()) {
                todo.vm.list.push(new todo.Todo({description: description()}));
                todo.vm.description("");
            }
        };
    }
};

but i didnt see the 'init: ' on the line 'init: function() {', so for a little while i thought this was just another way to make a new namespace.

i think doing it that way would be kinda confusing, but less confusing would be the opposite, where any namespace could be used as a function. Which i guess is just a convoluted way of saying that functions are first-class values? But i wonder if it could mean anything else than that.

--

http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

--

"In Python, x[(exp1, exp2, ..., expN)] is equivalent to x[exp1, exp2, ..., expN]; the latter is just syntactic sugar for the former."

" Remember that a slicing tuple can always be constructed as obj and used in the x[obj] notation. Slice objects can be used in the construction in place of the [start:stop:step] notation. For example, x[1:10:5,::-1] can also be implemented as obj = (slice(1,10,5), slice(None,None,-1)); x[obj] . This can be useful for constructing generic code that works on arrays of arbitrary dimension. "

" Basic slicing with more than one non-: entry in the slicing tuple, acts like repeated application of slicing using a single non-: entry, where the non-: entries are successively taken (with all other non-: entries replaced by :). Thus, x[ind1,...,ind2,:] acts like x[ind1][...,ind2,:] under basic slicing. "

" Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an ndarray (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool). There are two types of advanced indexing: integer and Boolean.

Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view). "

"

Warning

The definition of advanced indexing means that x[(1,2,3),] is fundamentally different than x[(1,2,3)]. The latter is equivalent to x[1,2,3] which will trigger basic selection while the former will trigger advanced indexing. Be sure to understand why this is occurs.

Also recognize that x[[1,2,3?]] will trigger advanced indexing, whereas x[[1,2,slice(None)?]] will trigger basic slicing. "

-- http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#arrays-indexing

--

'array scalars' are a good idea:

http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html

--

this is great stuff:

http://www.kythe.io/docs/schema/#_builtin_types_2

see [2].

---

"RocksDB? organizes all data in sorted order and the common operations are Get(key), Put(key), Delete(key) and Scan(key)."

RocksDB? also allows writes to be batched into atomic transactions

---

are these two equivalent?

g = [A=[B=.B]; B=3]

g = [A=[B=3]]

in both, g.A.B == 3

however, in the second one, it seems like there is no node 'B' in the graph.

---

https://en.wikipedia.org/wiki/Hierarchical_Data_Format#HDF5

https://en.wikipedia.org/wiki/NetCDF

---

this is a joke but this is basically what i'm trying to seriously do: make everything out of graphs (in most cases, just tree-like associative arrays):

http://elbenshira.com/blog/the-universal-data-structure/ https://news.ycombinator.com/item?id=9804777

the hackernews discussion does point out some real problems with this sort of thing though:

"

jfe 10 hours ago

"Hashes are always O(1) reads, inserts and writes."

Maybe, once you've found a location to read, insert, or write to. The author neglects the runtime cost required for the hash algorithm itself, which may not be trivial; computing the hash of a string key is typically an O(n) operation.

Furthermore, unless a suitable table size is selected, integer keys (should one use a map like an array) will eventually hash to the same value, requiring even more time to iterate through all the remaining values that match that key until the desired value is found.

"I don’t know why you would ever use [a linked list] over an array (or a hash)..."

Here's why: because arrays take up space whether you use it or not. Linked lists don't suffer from this problem, but at the cost of 1-2 pointers per item. Has the author seriously never managed memory before? Please tell me this article is a joke.

reply "

fasteo 10 hours ago

Like a well trained Pavlov dog[1], reading the title brought "Lua table"[2] to my mind, the most flexible data structure I have worked with, by far.

[1] https://en.wikipedia.org/wiki/Classical_conditioning

[2] http://www.lua.org/pil/2.5.html

reply

masklinn 10 hours ago

> reading the title brought "Lua table"[2] to my mind, the most flexible data structure I have worked with, by far.

Which is not necessarily a good thing. PHP's array and JS's Object are essentially the same thing.

reply

pygy_ 8 hours ago

Lua tables accept arbitrary objects as keys, and make a difference between `foo[1]` and `foo["1"]`.

Plus all the metatables goodies: weak key and/or value references, prototype inheritance, ...

reply

jschwartzi 8 hours ago

They even mark the difference to the degree that foo[1] is stored in a special "array part" of the table, while foo["1"] ends up in the hash part. Predictably, the array part uses less memory.

reply

thaumasiotes 7 hours ago

> Lua tables accept arbitrary objects as keys, and make a difference between `foo[1]` and `foo["1"]`.

That sounds... completely normal? Python dictionaries will do that too. So will Java HashMaps?.

reply

sirclueless 7 hours ago

That comment was responding to a comment about PHP and JS, both of which do not make such a distinction.

reply

seabee 7 hours ago

There is another subtlety that values associated with integer keys are stored as an array. dicts and HashMaps? don't do this.

reply

adrusi 4 hours ago

And not even all integer keys! Lua will efficiently handle sparse arrays using tables.

reply

carapace 7 hours ago

"...We propose that many, and maybe even all, interesting organizations of information and behaviour might be built from a single primitive operation: n-way associative lookup. ..."

http://www.vpri.org/pdf/tr2011003_abmdb.pdf

reply

synthmeat 9 hours ago

While this may be a mockery, title probably alludes to Universal Design Pattern[1], which is not so easily dismissable idea.

[1] http://steve-yegge.blogspot.com/2008/10/universal-design-pat...

reply

 Totient 6 hours ago

Satire, aside, I think a very short addition to the last line holds a lot of truth:

"A hash is simple. A hash is fast. A hash is all you need to start with".

I can think of plenty of good reasons to stop using a map/hash/associative array in code, but I can't think of very many good reasons not to start coding with associative arrays as your default data structure. If there's a performance/memory problem, fix it later. I've seen a lot more code suffer from premature optimization than I've seen suffer from using data structures that were a little too inefficient.

reply

brudgers 5 hours ago

When in doubt, use brute force. -- Ken Thompson

Using hashes as a first choice data structure is not necessarily a bad idea. 1] Until profiling a working implementation demonstrates otherwise, other data structures may be premature optimization.

[1] Clearly an improvement over the Lisper's association lists.

reply

 jcwilde 8 hours ago

My comment was meant tongue in cheek. Maps are _the_ universal data structure: they can be used to map any input to any output. The rest is just an implementation detail.

One might even call them "functions", but that ruins the joke.

reply

 malkia 10 hours ago

Basic had that. It was called arrays - yay for A$()

reply

jerf 10 hours ago

Having read over the entire thing, I have only one issues with it: The Universal Hash structure is only universal if the language implementing it permits you to create cyclic structures, so you can manipulate the "pointers" or relevant language concept to create the cyclic structures. There are a handful of languages that don't permit that, such as Erlang, and some languages like Haskell that permit you to use "tying the knot" to create cyclic structures [1], but ones that can sometimes be difficult to manipulate after the fact.

In those languages, you'll probably need to use a data structure well known for its simplicity, efficiency, and broad cross-language-platform compatibility: The Resource Description Framework's graph model. Graphs are also a well-known candidate for "universal data structure" [2]. Also, it's semantic, which is a clear advantage over the entirely not semantic UniversalHash?, because semantic is better than not semantic when it comes to universal semantics of semanticness.

Semantic.

Otherwise, the article is pretty solid and all serious developers should definitely consider implementing it forthwith, as they will find it a very, very educational experience. I know I've seen people implement this model in Java before, and I can certainly vouch for the fact that it was an education for all involved.

I'm going to say "semantic" one more time, because it's been my observation that the more you say it, the smarter you sound, so: semantic. Oh, yeah, that's the stuff. Mmmmmmmmm.

[1]: https://wiki.haskell.org/Tying_the_Knot

[2]: Seriously, if you really do want to discuss the "universal data structure", perhaps because you want to analyze data structures very generically for some mathematical reason, graphs really are a good candidate. Not necessarily RDF graphs, which are both bizarrely complicated while lacking simple features; the three blessed collection types are a weird combination of features.

reply

jsprogrammer 9 hours ago

Do you really need to create "true" (i.e. language level?) cyclic structures? Shouldn't you be able to simulate cyclic structures at the cost of requiring more space (and probably time) to compute the simulation?

reply

jerf 9 hours ago

That would get into the stuff at the bottom, when you simulate other data structures within your data structure. A non-cyclic data structure can simulate cyclicness with IDs on nodes and things that store links... it's generally how you do it in Haskell, in fact, since while saying "it can't" do true graphs is perhaps a smidge overstrong it is certainly not practical to try to modify knot-tied structures. (I've seen the question "How would I do a graph?" repeatedly on /r/haskell, and "use IDs in a map" is generally what comes back.) But you're putting a layer on top of your store.

(By the by... you know you can trust me, because... semantic. Semantic.)

reply

 amelius 9 hours ago

Ehh, relational databases already proved that "sets" are the true universal data structures. Anything can be built upon them, including (hash)maps.

reply

dragonwriter 4 hours ago

A relation is equivalent to a map from its key to its non-key attributes. (A hash-map is just an implementation detail in how a map/relation is implemented.)

So, really, that's not a different universal data structure.

reply

https://www.google.com/search?client=ubuntu&hs=4wq&channel=fs&q=+universality+of+relational&oq=+universality+of+relational&gs_l=serp.3...4041.5991.0.6116.6.6.0.0.0.0.285.698.0j3j1.4.0.ckpsrh...0...1.1.64.serp..3.3.575.setZOKyzkyw

http://pages.cs.wisc.edu/~anhai/courses/784-sp09-anhai/ahoUllman.pdf

erikpukinskis 6 hours ago

Jonathan Blow made the point recently that academic programming language writers always make the same mistake of trying to take an idea and fully radicalize it.

When you go from "Objects are super easy and useful in this language" to "Everything Is An Object" you basically doom yourself to using objects to implement a bunch of stuff that doesn't really make sense as objects and could be implemented much easier as another data structure.

Big-brainded academics love the challenge of "ooh, can I make everything an object?" because they are always free to decrease the scope of their research a little to compensate for the implementation taking a long time. And the more phenomena you can contort into agreement with your thesis, the more scholarpoints you get.

Blow advocates "data-driven programming" which, as a rule of thumb, I translate in my head as "don't move anything around you don't have to."

For example, rather than just copying a giant array of JSON objects over the wire when you only need some image URLs each with an array of timestamped strings, you write the code that serialized that data. And if you do that a few times, write the tooling you need to make that kind of thing easy.

The pitch is that it's not more work. And I'm kind of convinced. It just gets rid of so much pollution when you are debugging.

Your first cut of things is often a little weird: "do I need a generator generator here?" but typically you realize that a simpler solution works just as well in the refactor.

When you hack in a "wrong but easier to sketch out" solution into your code as the first try, it often just lives like that forever. Correct, confusing code often collapses into correct, simple code. Simple, functional-but-wrong code just seems less inclined to self improvement.

And I am continually surprised by how many problems, when simplified down as much as possible, are best described with the basics: functions, structs, arrays. You need fancy stuff sometimes for sure, but most of our human problems are trivial enough that these blunt tools suffice. I just often won't be able to see it until I've renamed all the variables three times.

What's interesting is I've been doing JavaScript? programming this way, and Jonathan Blow is... shall I say... not a fan of JS. But I think the concepts translate pretty well! It's just instead of targeting raw memory, you target the DOM/js runtime which is actually a pretty powerful piece of metal if you have the patience to learn about the runtime and keep thinking about your data every step of the way.

reply

this article even reminded one guy think of the everything-is-an-interface principal (so i think i'm onto something:

michaelochurch 4 hours ago

While this is satire, it brings to mind some bit of industry history. My first reaction was, "you want to define a type class called Associative because you're talking about an interface, and that got me thinking about OOP vs. Haskell's type classes (a superior approach) (...and then I realized that the OP was a satire.)

...


wagn might have a lot to teach me, too:

---

so, should we have anything having to do with versioning? and 'what was the value of this variable at that time'?

---

jacobolus 1 day ago

I’d love to have a shell-like environment where every command’s output was a stream of structured records with support for rich data types (ideally including stuff like images), not just lines of plain text.

The hard part here is not just figuring out all the core UI and protocols (though that would also take some work), but actually implementing all the hundreds or thousands of essential basic programs. Unfortunately existing technologies are so entrenched that this kind of thing isn’t going to happen without someone with very deep pockets funding it.

One neat thing is that a rich enough structured metaformat could be used directly for storing things like config files, logs, many types of structured documents, etc. directly to disk, and wrappers could be added to transcode existing legacy formats to/from the standard metaformat.

The difference could then be minimized between reading a typical file vs. reading the output of some tool, and likewise the difference could be minimized between reading config options from a file vs. passing in config options in the shell directly, etc. etc.

[Aside: also unfortunate is that there aren’t any solid document/protocol metaformats which would serve this purpose, as far as I can tell. The Clojure guys have generally the right idea with edn/fressian/transit, but the datatypes they’ve implemented are a bit too closely mapped directly to Clojure types (including types irrelevant in other contexts and missing meaningful distinctions from other contexts); in particular they haven’t really considered binary data types like images or big tables of numerical data. By contrast, data metaformats used in the scientific computing world like e.g. HDF don’t pay enough attention to standardizing complicated structures of non-numeric data. JSON and similar formats, even e.g. Apple plists, aren’t rich enough and so end up causing fragmented ad-hoc solutions to common problems. XML is terrible in almost every possible way. Etc.]

reply

jfim 1 day ago

Isn't that what Powershell does? It passes structured records between processes when using pipes[0] and they can be formatted and written to disk[1], as they're not just a stream of characters.

[0] http://www.tomsitpro.com/articles/powershell-piping-filterin...

[1] http://blogs.technet.com/b/heyscriptingguy/archive/2014/06/3...

reply

jacobolus 1 day ago

I’m not a Windows user, so I couldn’t tell you precisely. Those two articles don’t talk much about what kind of structure/format the records getting piped around have. Are they arbitrary hierarchies with rich (ideally extensible) data-types?

reply

Mikhail_Edoshin 1 day ago

I understand it's a holywar topic, but what's so bad about XML? I'd say it's a very nice serialization format for arbitrary data with a host of very powerful tools around it. I would love to see more software offering an XML dump option for their internal formats.

reply

jacobolus 1 day ago

XML is a very complex spec which is difficult to implement properly, a heavy format with high storage overhead, which is extremely expensive to parse or process, but also too verbose and finicky to be pleasant for human editing. It doesn’t have built-in standard support for most of the common data types you want in a structured document, so they are all stored as strings or sequences of tags, and then parsed out in an ad-hoc way by each tool built on top. Its namespace feature is ineffective and often a potential security vulnerability. Its separation between attributes and elements is handled arbitrarily by various XML-derived formats and tools, usually inconsistently within the same format. It has terrible support for big arrays of numeric or other binary data. Etc. Etc.

XML, like SGML, is plausibly reasonable when you have something like a word processor document or web page, but is wholly inappropriate for almost every other use.

Notice that despite its acute limitations, JSON ended up as the metaformat of choice for most Web APIs.

reply

Mikhail_Edoshin 1 day ago

I usually save web pages as XPS or PDF, so I can compare the lengths. XML 1.0 specs is 56 pages; by contrast, YAML 3.0 spec with similar formatting is 96 pages. And XML specs describes both the serialization format and simple grammar-based validation for the resulting high-level language (DTD); YAML only describes serialization.

XML is relatively verbose, but this is by design and is clearly stated as design goal #10: "Terseness in XML markup is of minimal importance."

The grammar for XML serialization itself clearly has 1-character lookahead structure, so the parser must be deterministic and thus work in linear time. The tools that process XML (e.g. XML Schema, XPath or XSLT) are based on tree automata and, in most cases, work in linear time as well. (Of course, one can end up with a slow XSLT, I meet them all the time, but one can end up with a slow regex too.)

XML Schema provides very good types and a way to define your own types. I admit this part is relatively complex, but I think it's inherent complexity. If you have a Schema-aware parser, you'll get all the usual types (numbers, dates) and even more so, plus a better (more powerful) formal description of the high-level language than DTD. (For example, DTD requires all structures to have different names, while Schema can define context-aware types.) And Relax-NG is even more powerful. This extra description power doesn't increase the runtime complexity though, it's still linear time.

I don't know what you mean by namespaces being ineffective or vulnerable; I'd say it's as good as it gets for an extensible framework of roll-your-own languages without central authority.

The structure of a particular XML-based format (i.e. tag names, use of attributes, etc.) is the responsibility of the author of this format. Yes, some are very sloppy and illogical, but a lot of code is, regardless of the language.

I agree about huge arrays; XML was never meant to handle them. But modern tools perform very well on moderate and even large amounts of data; a few hundred megabytes is not a problem at all.

XML is not just plausibly reasonable for word processing or web documents, it's the only format designed to handle such (mixed) content.

There is some shortage of tools, most state-of-art tools now are Java-based and this doesn't work for everyone. But the biggest problem with XML is the amount of FUD and prejudice that accompanies nearly every mention of it.

reply

EdiX? 1 day ago

>what's so bad about XML?

That no programming language deals natively with XML's data structure. That's why xpath and xslt needed to be invented. This suggests that XML's data structure is not actually a good mapping for people's problems.

reply

Mikhail_Edoshin 1 day ago

I don't know about all the landscape, but in Python, at least, with `lxml`, you can configure the parser to yield native Python objects. I.e. you parse a XML file and get your own objects as a result. Here "your own" part is limited to your class and methods (no data, except what is in the element itself), but it's already rather convenient. (I can't say `lxml` is simple and Pythonic though; it's rather cumbersome to boot.)

reply

metasean 1 day ago

By chance, have you ever had a chance to use Quicksilver? If so, what are your thoughts on it? http://qsapp.com/about.php

reply

---

bad_user 1 day ago

> What I'd like to see is a CLI that (a) understands objects by default (i.e. PowerShell?) and (b) is discoverable, for example by using mouse interactions when you're trying to learn.

I do not agree. PowerShell? is not that great, partly because in spite of popular opinion, text is more composable than objects and doesn't tie you to a particular platform.

And in the context of the article, I would argue that "objects" are a dominant design, being often misapplied and misunderstood. Besides, what you need for communication between processes aren't objects, as objects as commonly understood have identity and objects with identity can't be serialized. What you need is a way to structure that information by means of basic data-structures, like dictionaries or lists. But lo-and-behold, that's what JSON is for.

reply

eru 1 day ago

Discoverability in the CLI can be much improved with proper completion and a hint system. The fish shell is doing a good job of exploring these ideas.

What do you mean by `objects' in this context? And how would they help?

reply

grkvlt 1 day ago

> ls

[entry.created for entry in $@ if entry.filename[0] == 'a']sort

You can do that type of thing with PowerShell? quite easily. Something like this, maybe:

    Get-ChildItem | Where-Object Name.substring(0,1) -eq 'a' | Sort-Object CreationTime |

---

http://netflix.github.io/falcor/documentation/jsongraph.html

{ todosById: { "44": { name: "get milk from corner store", done: false, prerequisites: [{ $type: "ref", value: ["todosById", 54] }] }, "54": { name: "withdraw money from ATM", done: false, prerequisites: [] } }, todos: [ { $type: "ref", value: ["todosById", 44] }, { $type: "ref", value: ["todosById", 54] } ] };

New Primitive Value Types

In addition to JSON’s primitive types, JSON Graph introduces three new primitive types:

    Reference
    Atom
    Error

Each of these types is a JSON Graph object with a “$type” key that differentiates it from regular JSON objects, and describes the type of its “value” key. These three JSON Graph primitive types are always retrieved and replaced in their entirety just like a primitive JSON value. None of the JSON Graph values can be mutated using any of the available abstract JSON Graph operations.

Atom

An Atom is a JSON object with a “$type” key that has a value of “atom” and a ”value” key that contains a JSON value.

{ $type: "atom", value: ['en', 'fr'] }

JSON Graph allows metadata to be attached to values to control how they are handled by clients. For example, metadata can be attached to values to control how long values stay a client cache. For more information see Sentinel Metadata.

One issue is that JavaScript? value types do not preserve any metadata attached to them when they are serialized as JSON:

var number = 4; number['$expires'] = 5000;

console.log(JSON.stringify(number, null, 4))

This outputs the following to the console: 4

Atoms “box” value types inside of a JSON object, allowing metadata to be attached to them.

var number = { $type: "atom", value: 4, $expires: 5000 };

console.log(JSON.stringify(number, null, 4))

This outputs the following to the console: { "$type": "atom", "value": 4, "$expires": 5000 }

The value of an Atom is always treated like a value type, meaning it is retrieved and set in its entirety. An Atom cannot be mutated using any of the abstract JSON Graph operations. Instead you must replace Atoms entirely using the abstract set operation.

---

going along with my idea of trying to realize control flow constructs, and particularly concurrency constructs, as literal graph operations, there should be 'fork' (or 'split') and 'merge' operations on graphs. These will also come in handy for versioning.

---

---

so what is being sent as the content of all of these (typed) messages in these channels, channels which are like Unix pipes but which transmit structures, not just strings of text?

i guess it must be Oot Graphs

so each message is one Oot Graph. Each Oot Graph message might be something like a single RDF statement, but it might also be something bigger (or something like RDF but with n-tuples instead of 3-tuples, eg some extra roles for provenance, certainty, other modalities, etc). But graphs can contains 'foreign key references' to nodes in other graphs. These could refer to another graph that has already been sent down the channel. Or to another graph which will be sent (we need a way to indicate that a given foreign key doesn't have a match to a target yet, or that the sender has promised to send the target later but hasnt yet). Or to a graph which was shared before a fork, or in a shared db, or even on the internet somewhere (like Lion Kimbro's nLSD)

(note: in rereading the nLSD page just now, i had forgotten about Lion's ideas of (a) the 'START' flag (which we can do with graph labels), (b) binding function calls, which we can do with our standardized homeoiconic AST representation of Oot Core (Lion just meant to bind to a local version of the given function, which we can also do with an Oot AST, by making a call to a function which is not defined in the AST but which is instead assumed to be defined in the environment in which/against which the AST is executed).)

todo: meditate more on Lion's, Murray's, and sigi's examples there

( note: i think there's a mistake in sigi's representation; he says that : ((lion and kitty) has child) IS sakura and he says that (sakura has parents) IS (lion and kitty)

(my caps); i think 'is' in those cases would refer to the reified statement (a reified edge), not to who the child or parents are; that's more like a secondary predicate attached to 'child' or 'parents', as a referent is attached to a pronoun; or possibly like an indirect object. )

note also sigi's formalism has the idea of substitution and lambda-calculus-like-abstraction:

---

there should be a way to state that two nodes are 'the same node' and/or that two different foreign keys refer to the same node and/or that a foreign key's referent is a given node. Eg we need node/foreign key 'equations'.

---

need to distinguish reified statements from reified edges; eg if we have "Bob is blue" (nodes Bob and Blue connected by an edge from Bob to Blue labeled 'color'), then we need a way to point to the edge from Bob to Blue, and we also need a distinct way to point to the statement "Bob is Blue"; eg -->(Bob is blue) vs Bob -->(is) blue

---

https://en.wikipedia.org/wiki/JSON#Object_references

http://json-ld.org/

---

Facebook Relay has something much like my Views in their graph-based data structure!:

https://github.com/facebook/relay/blob/master/docs/QuickStart-Tutorial.md

---

https://code.facebook.com/posts/1691455094417024

---

https://en.wikipedia.org/wiki/Data_type

---

maybe .---set should return a mutated COPY of the thing, and x.y = z should be syntactic sugar for a variable assignment to something involving .---set?

thinking aloud:

field mutation:

x.field1 = 3

field mutation as a non-mutating hof: x.---set(FIELD1, _) (equivalently, x.---set(-field1, _))

---

it's annoying that in Python, you get an error if you try to do

decimal.Decimal(1)*0.5

but is this just the cost of not having implicit type conversion?

at least decimal.Decimal(1)*2 works

there should be something perhaps involving subset types or alternate representations that narrowly allows some implict type conversions like this. Eg Decimal is a subset type of Float (or maybe, an alternate representation of the same type), so this implicit conversion should be allowed

---

a common type of query is a conjunction of assertions, where each assertion only involves atomic comparisons (==, !=, <=, <, >=, >) on properties

eg x.a > 1 AND x.b == 'hi' AND x.c != 'bye' AND x.d >= January 1st 1940

---

it's key that:

data = array([[ 2.006667,-1.50593973,-0.78853564,1.10413865,-2.31074649,0.00471899 ,1.44112339,-0.38238545,2.56819255,-2.0435719,], [ 0.0345696,2.31754328,-1.1448747,-0.44883128,1.76618548,1.8109252 ,-1.11762536,-2.60011543,0.83882903,-2.9113708,], [ 3.04685474,-1.03480186,0.7027797,-2.81070615,-1.6997237,1.90667283 ,-2.84015824,2.48292486,0.81745892,0.4002785,], [ 1.92352707,1.60024121,0.42985859,1.26019128,0.47607111,3.70762613 ,1.27608097,-3.13248717,3.72322113,1.69854045], [-3.60006018,0.27755204,-1.52015275,-0.24280737,-2.49276979,-1.07124972 ,1.3779443,-2.61218838,0.93937661,-2.05147147], [-4.11276044,-2.047335,-1.96500956,-2.64921263,0.86339691,-1.88079199 ,-0.13924873,-2.72464465,0.44741199,0.61568983], ])

for datum in data: print datum

should do the right thing (that is, print

[ 2.006667 -1.50593973 -0.78853564 1.10413865 -2.31074649 0.00471899 1.44112339 -0.38238545 2.56819255 -2.0435719 ] [ 0.0345696 2.31754328 -1.1448747 -0.44883128 1.76618548 1.8109252 -1.11762536 -2.60011543 0.83882903 -2.9113708 ] ... )

(what i mean is that the parity for entering rows of data one at a time in the literal matrix constructor must match the parity by which the default iterator retrieves them; an example of 'wrong' would be if 'print datum' pulled out columns, eg mixed up/transposed the dimensional orientation in which i gave the data).

---

it's neat how in pandas if you slice a DataFrame? by a single index, it makes a 1D Series but it remembers the name of the index:

DataFrame?(index=['a', 'b', 'c'], columns = [0, 1,3], data = [[1,2,3], [4,5,6], [7,8,9]]).ix['a', :]

Out[46]: 0 1 1 2 3 3 Name: a

(the same is not true if you slice a single column btw, confusingly):

DataFrame?(index=['a', 'b', 'c'], columns = [0, 1,3], data = [[1,2,3], [4,5,6], [7,8,9]]).ix Out[49]: a 1 b 4 c 7

---

a review of dataframe libraries by the pandas guy:

http://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly http://www.slideshare.net/wesm/dataframes-the-extended-cut

some probs with pandas, and a new lib 'badger' by the pandas guy and colleagues:

http://thiagomarzagao.com/2013/11/11/pandas-shortcomings/

http://ajkl.github.io/2014/11/23/Dataframes/ https://news.ycombinator.com/item?id=8794276

---

"features likes missing values, data frames, and subsetting." -- http://adv-r.had.co.nz/Introduction.html

---

in R,

the three properties of a vector, other than its contents:

-- [3]

---

mutation on immutables -> like python datetime.datetime.replace -> but want more concise

---

smalltalk unifies:

in essence, fields of objects return their current value when sent a message with no content, and change their value when sent a message with content

some of those are not quite accurate but they are still food for thought

---

http://blog.circleci.com/why-we-use-om-and-why-were-excited-for-om-next/

i read this. it touches on surprisingly many Oot ideas, including Views (they say that in Om Next, the shape of the data as stored and as queried is different) and graphs (they say that in Om, the shape of the data is a tree, but in Om Next it is a graph), and 'capturing' mutations to immutable objects (they seem to be saying that in Om Next there are first-class mutations) and query languages

---