proj-oot-ootNotes17

---

one thing to do is to look at frequent words and patterns in existing code corpora in order to see what is common, so that we can consider thinking about these things as fundamental, and also, more pedestrianly, we can optimize for making them easy to read and write in oot:

text mining on Java source code. Most frequent lexemes: " The most commonly occurring word is the + operator (305,685) followed by the scoping block (295,726) and the = operator (124,813). If we exclude operators and scoping blocks from our analysis, the most frequent words are public (124,399), if (119,787), and int (108,709). The most common identi er (the programming language equivalent of a lexical word in natural language as discussed in Section 3.2), is String . It is the ninth most frequently occurring word overall with 71,504 occurrences. This pseudo-primitive type in Java is a special case of a non-primitive that has nearly achieved primitive status in the language and may well do so in either a future version of Java or a derivative language it spawns. The next three most frequent lexical words are length (19,312), Object (18,506), and IOException (11,322). " -- http://flosshub.org/sites/flosshub.org/files/21st-delorey.pdf

" Top Idioms

Figure 6 shows the top idioms mined in the Library data set, ranked by the number of files in the test sets where each idiom has appeared in. The reader will observe their immediate usefulness. Some idioms capture how to retrieve or instantiate an object. For example, in Figure 6, the idiom 6a captures the instantiation of a message channel in RabbitMQ?, 6q retrieves a handle for the Hadoop file system, 6e builds a SearchSourceBuilder? in Elasticsearch and 6l retrieves a URL using JSoup. Other idioms capture important transactional properties of code: idiom 6h demonstrates proper use of the memory-hungry RevWalk? object in JGit and 6i is a transaction idiom in Neo4J. Other idioms capture common error handling, such as 6d for Neo4J and 6p for a Hibernate transaction. Finally, some idioms capture common operations, such as closing a connection in Netty (6m), traversing through the database nodes (6n), visiting all AST nodes in a JavaScript? file in Rhino (6k) and computing the distance between two locations (6g) in Android. The reader may observe that these idioms provide a meaningful set of coding patterns for each library, capturing semantically consistent actions that a developer is likely to need when using these libraries. In Figure 7 we present a small set of general Java idioms mined across all data sets by Haggis . These idioms represent frequently used patterns that could be included by default in tools such as Eclipse’s SnipMatch? [ 43 ] and IntelliJ’s? live templates [ 23 ]. These include idioms for defining constants (Figure 7c), creating loggers (Figure 7b) and iterating through an iterable (Figure 7a).

Figure 6: Top cross-project idioms for L ibrary projects (Figure 4). Here we include idioms that appear in the test set files. We rank them by the number of distinct files they appear in and restrict into presenting idioms that contain at least one library-specific ( i.e. API-specific) identifier. The special notation $(TypeName?) denotes the presence of a variable whose name is undefined. $BODY$ denotes a user-defined code block of one or more statements, $name a freely defined (variable) name, $methodInvoc a single method invocation statement and $ifstatement a single if statement. All the idioms have been automatically identified by Haggis

channel=connection. createChannel();

Elements $name=$(Element). select($StringLit?);

Transaction tx=ConnectionFactory?. getDatabase().beginTx();

catch (Exception e){ $(Transaction).failure(); }

SearchSourceBuilder? builder= getQueryTranslator().build( $(ContentIndexQuery?));

LocationManager? $name = (LocationManager?)getSystemService( Context.LOCATION_SERVICE);

Location.distanceBetween( $(Location).getLatitude(), $(Location).getLongitude(), $...);

try { $BODY$ } finally { $(RevWalk?).release(); }

try { Node $name=$methodInvoc(); $BODY$ } finally { $(Transaction).finish(); }

ConnectionFactory? factory = new ConnectionFactory?(); $methodInvoc(); Connection connection = factory.newConnection();

while ($(ModelNode?) != null ){ if ($(ModelNode?) == limit) break ; $ifstatement $(ModelNode?)=$(ModelNode?) .getParentModelNode(); }

Document doc=Jsoup.connect(URL). userAgent("Mozilla"). header("Accept","text/html"). get();

if ($(Connection) != null ){ try { $(Connection).close(); } catch (Exception ignore){} }

Traverser traverser

$(Node).traverse();

for (Node $name : traverser){ $BODY$ }

Toast.makeText( this , $stringLit,Toast.LENGTH_SHORT) .show()

try { Session session

HibernateUtil

.currentSession(); $BODY$ } catch (HibernateException? e){ throw new DaoException?(e); }

FileSystem? $name

FileSystem.get(

$(Path).toUri(),conf);

(token=$(XContentParser?) .nextToken()) != XContentParser? .Token.END_OBJECT

Figure 7: Sample language-specific idioms. $StringLit? denotes a user-defined string literal, $name a (variable) name, $methodInvoc a method invocation statement, $ifstatement an if statement and $BODY$ a code block.

(a) Iterate through the elements of an Iterator: (Iterator iter=$methodInvoc; iter.hasNext(); ) {$BODY$}

(b) Creating a logger for a class:

private final static Log $name= LogFactory?.getLog($type. class );

(c) Defining a constant String:

public static final String $name = $StringLit?;

(d) Looping through lines from a BufferedReader?:

while (($(String) = $(BufferedReader?). readLine()) != null ) {$BODY$}

-- http://homepages.inf.ed.ac.uk/csutton/publications/idioms.pdf

One interesting observation is that 50% of Java methods are 3 lines or less. Manually inspecting these methods we find accessors (setters and getters) or empty methods (e.g. constructors).

-- http://homepages.inf.ed.ac.uk/csutton/publications/msr2013.pdf

Table 2: The attribute catalogue

Name Formal definition Returns void The return descriptor is V . No parameters The list of parameter descriptors is empty. Field reader GETFIELD or GETSTATIC instruction. Field writer PUTFIELD or PUTSTATIC instruction. Contains loop Jump instructions that allow for instructions to be executed more than once in the same method invocation. Creates object NEW instruction. Throws exception ATHROW instruction. Type manipulator INSTANCEOF or CHECKCAST instruction. Local assignment One of the STORE instructions (for instance, ISTORE ). Same name call Calls a method of the same name.

The name get is interesting because it is by far the most common one; nearly a third of all Java methods in the corpus are get-methods.

Lexicon Entries.

ACCEPT. Methods named accept very seldom read state. Furthermore, theyrarely throw exceptions, call methods of the same name, create objects, manipulate state, use local variables, have no parameters, perform type- checking or contain loops. The name accept has a precise use. A similar name is visit . Generalisations of accept are handle and initialize . Somewhat related names are set , end , is and insert .

ACTION. Methods named action never call methods of the same name. Further- more, they very often read state. Finally, they often return void, and rarely throw exceptions, have no parameters or contain loops. The name action has a precise use. Similar names are remove and add.

ADD. Among the most common method names. Methods named add often read state. Similar names are remove and action .

CHECK. Methods named check very often throw exceptions. Furthermore, they often create objects and contain loops, and rarely call methods of the same name. Unfortunately, check is an imprecise name for a method.

CLEAR. Methods named clear very often have no parameters. Furthermore, they often return void, call methods of the same name and manipulate state, and rarely create objects, use local variables or perform type-checking. A generalisation of clear is reset . A somewhat related name is close .

CLOSE. Methods named close often return void, call methods of the same name, manipulate state, read state and have no parameters, and rarely create objects or perform type-checking. A generalisation of close is validate . A somewhat related name is clear .

CREATE. Among the most common method names. Methods named create very often create objects. Furthermore, they rarely call methods of the same name, read state or contain loops.

DO. Methods named do often throw exceptions and perform type-checking, and rarely call methods of the same name. Unfortunately, do is an imprecise name for a method.

DUMP. Methods named dump never throw exceptions. Furthermore, they very often create objects and use local variables, and very seldom read state. Finally, they often call methods of the same name and contain loops, and rarely manipulate state. The name dump has a precise use.

END. Methods named end often return void, and rarely create objects, use local variables, read state or contain loops. Generalisations of end are handle and initialize . A specialisation of end is insert . Somewhat related names are accept , set , visit and write .

EQUALS. Methods named equals never return void, throw exceptions, create objects, manipulate state or have no parameters. Furthermore, they very often call methods of the same name and perform type-checking. Finally, they often use local variables and read state. The name equals has a precise use.

FIND. Methods named find very often use local variables and contain loops. Furthermore, they often perform type-checking, and rarely return void.

GENERATE. Methods named generate often create objects, use local variables and contain loops, and rarely call methods of the same name. Unfortunately, generate is an imprecise name for a method.

GET. The most common method name. Methods named get often read state and have no parameters, and rarely return void, call methods of the same name, manipulate state, use local variables or contain loops. A similar name is has . Specialisations of get are is and size . A somewhat related name is hash .

HANDLE. Methods named handle often read state, and rarely call methods of the same name. A similar name is initialize . Specialisations of handle are accept , set , visit , end and insert .

HAS. Methods named has often have no parameters, and rarely return void, throw exceptions, create objects, manipulate state, use local variables or perform type-checking. The name has has a precise use. A similar name is get . Specialisations of has are is and size . A somewhat related name is hash

HASH. Methods named hash always have no parameters, and never return void, throw exceptions, create objects or perform type-checking. Furthermore, they very often call methods of the same name. Finally, they often read state, and rarely manipulate state or use local variables. The name hash has a precise use. Somewhat related names are has , is , get and size .

INIT. Methods named init very often manipulate state. Furthermore, they often return void, create objects and have no parameters, and rarely call methods of the same name.

INITIALIZE. Methods named initialize often return void and manipulate state, and rarely call methods of the same name or read state. A similar name is handle . Specialisations of initialize are accept , set , visit , end and insert .

INSERT. Methods named insert often throw exceptions, and rarely create objects, read state, have no parameters or contain loops. Generalisations of insert are handle , end and initialize . Somewhat related names are accept , set , visit and write .

IS. The third most common method name. Methods named is often have no parameters, and rarely return void, throw exceptions, call methods of the same name, create objects, manipulate state, use local variables, perform type- checking or contain loops. The name is has a precise use. Generalisations of is are has and get . Somewhat related names are accept , visit , hash and size .

LOAD. Methods named load very often use local variables. Furthermore, they often throw exceptions, create objects, manipulate state, perform type-checking and contain loops. Unfortunately, load is an imprecise name for a method.

MAKE. Methods named make very often create objects. Furthermore, they rarely return void, throw exceptions, call methods of the same name or contain loops.

NEW. Methods named new never contain loops. Furthermore, they very seldom use local variables. Finally, they often call methods of the same name and create objects, and rarely return void, manipulate state or read state.

NEXT. Methods named next very often manipulate state and read state. Furthermore, they often throw exceptions and have no parameters, and rarely return void.

PARSE. Among the most common method names. Methods named parse very often call methods of the same name, read state and perform type-checking. Furthermore, they rarely use local variables. The name parse has a precise use.

PRINT. Methods named print often call methods of the same name and contain loops, and rarely throw exceptions or manipulate state.

PROCESS. Methods named process very often use local variables and contain loops. Furthermore, they often throw exceptions, create objects, read state and perform type-checking, and rarely call methods of the same name. Unfortunately, process is an imprecise name for a method.

READ. Methods named read often throw exceptions, call methods of the same name, create objects, manipulate state, use local variables and contain loops. Unfortunately, read is an imprecise name for a method.

REMOVE. Among the most common method names. Methods named remove often throw exceptions. Similar names are add and action .

RESET. Methods named reset very often manipulate state. Furthermore, they often return void and have no parameters, and rarely create objects, use local variables or perform type-checking. A specialisation of reset is clear .

RUN. Among the most common method names. Methods named run very often read state. Furthermore, they often have no parameters, and rarely call methods of the same name.

SET. The second most common method name. Methods named set very often manipulate state, and very seldom use local variables or read state. Furthermore, they often return void, and rarely call methods of the same name, create objects, have no parameters, perform type-checking or contain loops. The name set has a precise use. Generalisations of set are handle and initialize . Somewhat related names are accept , visit , end and insert .

SIZE. Methods named size always have no parameters, and never return void, create objects, manipulate state, perform type-checking or contain loops. Furthermore, they very seldom use local variables. Finally, they rarely read state. The name size has a precise use. Generalisations of size are has and get . Somewhat related names are is and hash .

START. Methods named start often return void, manipulate state and read state.

TO. Among the most common method names. Methods named to very often call methods of the same name and create objects. Furthermore, they often have no parameters, and rarely return void, throw exceptions, manipulate state or perform type-checking.

UPDATE. Methods named update often return void and read state.

VALIDATE. Methods named validate very often throw exceptions. Furthermore, they often create objects and have no parameters, and rarely manipulate state. A specialisation of validate is close .

VISIT. Methods named visit rarely throw exceptions, use local variables, read state or have no parameters. A similar name is accept . Generalisations of visit are handle and initialize . Somewhat related names are set , end , is and insert .

WRITE. Among the most common method names. Methods named write often return void and call methods of the same name, and rarely have no parameters. Somewhat related names are end and insert .

-- The Programmer’s Lexicon, Volume I: The Verbs

---

(at least) 3 ways to loop: while, jump, hof search function (but is that same as while?). also colllection-oriented looping is not universal for control, but does most of it. also for long-lasting things where you want to just 'loop forever and respond to events until i decide to terminate', instead of having that loop in your program, you could register with a manager that calls you upon each iteration (but the manager still has to loop) (is this really that different from a while loop?).

---

examples of things that cannot directly be 'inlined' in some languages:

---

fleshing out the idea of a general, fundamental Search operator (generalization of a fixpoint operator) a little:

the search operator takes parameters in two stages, that is, it is a higher-order function that takes two arguments, each of which are 'packages' of functions and parameters (or it could just take a bunch of arguments, with no'packaging'). First, it takes a group of parameters that specify a search strategy. This includes functions that say how to initialize the search's internal state, how to choose the next search position given some internal state (perhaps the previous search position and its score), and when to terminate the search. For example, by giving different functions for these inputs, you can create a depth-first search, a breadth-first search, an A* search, a fixpoint operator (terminate upon idempotency of "next search position"), or a search that quits when it hits a plateu (a near-fixpoint) even if the objective function is still moving some tiny amount.

Second, it takes another group of parameters to choose the objective function, and to set the items to be searched through, and to choose the initial location of the search.

note that this is equivalent to an OOP system where there is an abstract base class (the Search operator), and concrete subclasses (breadth-first search, depth-first search, A*, fixpoint search, a search that quits when it hits a plateua). The class defines/satisfies an interface that has one method (do_search or the like).

---

the example of a general fundamental Search operator shows us that, when an OOP base class's purpose is just to be Called through one primary method call, it corresponds to a higher-order function.

---

was talking about goal-oriented programming with my friend DR. I explained my idea that you could specify a goal in terms of preconditions and postconditions (eg defining a sort) and the compiler could find a subroutine to satisfy them. And then you could add time and space complexity requirements, eg "cannot require more than O(n^2) space". And then you could add time and space complexity hints, eg "i am going to read this data structure a lot but rarely write to it". DR pointed out that means giving priorities. I said the trouble with priorities is that in a formal mathematical sense if you say "maximize x at the expense of y" you might end up with some solution that is EXTREMELY costly in y for a tiny gain in x, which is usually not what humans mean when they say to another human "prioritize a over b"; DR noted this doesn't mean priority is not useful here, it just means that we are looking to explore this fuzzier definition of priority. Also i mentioned Alan Key's phrase policy-oriented programming (or something like that). DR then pointed out that so far in this convo we have three concepts to think about for goal-oriented programming:

---

another interesting goal for a simple language is to think of what would be desired for a post-apocalyptic scenario. I think this is unlikely (even conditional upon an apocalypse, which is already unlikely), but imagine if a few individuals have working computers but only a few of them, and so much has been lost that no one has the complete of the toolchain, eg. whatever is required to cross-compile gcc onto a new architecture; or (even less likely) imagine if gcc is available, but not the gcc source code; and (even less likely) imagine that there is no comprehensive C specification or even documentation around; in such a situation, programming language implementations would have to be re-implemented by ordinary programmers (not compiler specialists) based on what they remember about the language (they can maybe refer to some code samples from a few personal projects they happened to have on their personal machine at the time of the apocalypse). Assume that communication is initially spotty enough that there are multiple re-implementors who do not know of each other until much later. Most likely these different re-implementors would misremember different things about the language definition and we'd get a family of mutually incompatible, C-like new languages.

Contrast with e.g. BASIC; i bet everyone would remember BASIC pretty well and the result would be a family of real BASIC dialects, not just vaguely related new languages.

If you substitute 'oot' for 'C' here, we would want Oot to be easier to remember than C; more like BASIC.

As noted above, i think this is unlikely to happen in the real world, even if there were an apocalypse, but it's a good thought experiment to push the language to be 'simple' and to think of whether a language 'fits in your head'.

---

http://stackoverflow.com/questions/10858787/what-are-the-uses-for-tags-in-go

---

http://www.geeksforgeeks.org/write-a-function-to-reverse-the-nodes-of-a-linked-list/

---

assertion of fact (opposite: query fact (and then match pattern to multiassign results)) (you could assert an equation rather than a special value assignment, but you could assert a value assignment too) vs command vs assignment (which is like pronouns) and also (although mb not separate from the above) RESTful interactions like GET address, SET (PUT) document=value; CRUD; VERBs applied to NOUNs, possibly with other arguments too (eg the value being assigned to a document in PUT)

---

evaluation strategy relates to variable substitution, but also to time, as variable substitution is an analog of time (the sequence of computation) within the timeless realm of purity (referential transarency)

---

jcrites 2 days ago

The article is discussing documentation for the AWS Flow Framework specifically. The Flow Framework is a Java framework built on top of the SWF API, and it provides a completely different programming model than the SWF API.

The C# example being discussed, as well as the Java example at the end, are examples of using that SWF API directly. The SWF API is indeed simpler for trivial examples. Flow is a power tool that handles complex workflows better than any alternative I've seen, but the framework itself is complex and incurs cost to set up and use. The documentation could do a better job of explaining this, and of providing Java API examples.

The Flow framework provides something that's a mix of Java code and a domain-specific language for building SWF applications that's expressed as Java code. (For an analogy, consider EasyMock?.) It's hard to explain Flow concisely, but if I had to try I'd say, "You write code that looks like it's procedural and runs on one machine, and Flow converts that into a distributed workflow running across of fleet of machines, all of which may be stateless". Flow threads the state into and out of SWF for you, making it possible to express distributed workflows at a higher level of abstraction. Some examples are in the AWS Flow Framework Recipes: https://aws.amazon.com/code/2535278400103493

To achieve this, Flow uses fancy code weaving & AOP techniques. This makes it more complicated to set up, develop, and test. Flow pays off however once your workflow is more complex than a simple linear workflow. You could, for example, process data with a distributed map/reduce pattern in a few lines of code in Flow. (Source: built production systems on SWF with and without Flow)

reply

---

" In other words, state machines should not be specified as tuples that connect two states (S1, A, S2) as they traditionally are, they are rather tuples of the form (Sk, Ak1, Ak2,…) that specify all the actions enabled, given a state Sk, with the resulting state being computed after an action has been applied to the system, and the model has processed the updates. "

---

prewett 8 hours ago

I think part of the problem is that MVC is pretty heavyweight. Most UI doesn't need that kind of flexibility, but when you want it, you want it. So you need a way to make it simple most of the time and still have access to the details.

In web development it is probably complicated by that fact that, in my opinion, declarative positioning and sizing of elements is a pipe dream. It looks simple until you try to actually implement it, and HTML/CSS has only a rudimentary implementation. (As far as I know, Motif and Apple's constraints are the only UI toolkits that have a solid implementation) Given what we want to do with the web these days, I think we would be better off with programming the web page declaratively. Something like what Qt does. I've never found an easier way to write a UI than Qt.

reply

---

term rewriting vs. lambda calculus:

http://stackoverflow.com/questions/24330902/how-does-term-rewriting-based-evaluation-work :

" How does term-rewriting based evaluation work?

The Pure programming language is apparently based on term rewriting, instead of the lambda-calculus that traditionally underlies similar-looking languages. ...

The matching of patterns, and substitution into output expressions, superficially looks a bit like syntax-rules to me (or even the humble #define), but the main feature of that is obviously that it happens before rather than during evaluation, whereas Pure is fully dynamic and there is no obvious phase separation in its evaluation system (and in fact otherwise Lisp macro systems have always made a big noise about how they are not different from function application). Being able to manipulate symbolic expression values is cool'n'all, but also seems like an artifact of the dynamic type system rather than something core to the evaluation strategy (pretty sure you could overload operators in Scheme to work on symbolic values; in fact you can even do it in C++ with expression templates).

So what is the mechanical/operational difference between term rewriting (as used by Pure) and traditional function application, as the underlying model of evaluation, when substitution happens in both?

1 Answer

Term rewriting doesn't have to look anything like function application, but languages like Pure emphasise this style because a) beta-reduction is simple to define as a rewrite rule and b) functional programming is a well-understood paradigm.

A counter-example would be a blackboard or tuple-space paradigm, which term-rewriting is also well-suited for.

One practical difference between beta-reduction and full term-rewriting is that rewrite rules can operate on the definition of an expression, rather than just its value. This includes pattern-matching on reducible expressions:

-- Functional style map f nil = nil map f (cons x xs) = cons (f x) (map f xs)

-- Compose f and g before mapping, to prevent traversing xs twice result = map (compose f g) xs

-- Term-rewriting style: spot double-maps before they're reduced map f (map g xs) = map (compose f g) xs map f nil = nil map f (cons x xs) = cons (f x) (map f xs)

-- All double maps are now automatically fused result = map f (map g xs)

Notice that we can do this with LISP macros (or C++ templates), since they are a term-rewriting system, but this style blurs LISP's crisp distinction between macros and functions.

CPP's #define isn't equivalent, since it's not safe or hygenic (sytactically-valid programs can become invalid after pre-processing).

...

Another practical consideration is that rewrite rules must be confluent if we want deterministic results, ie. we get the same result regardless of which order we apply the rules in. No algorithm can check this for us (it's undecidable in general) and the search space is far too large for individual tests to tell us much. Instead we must convince ourselves that our system is confluent by some formal or informal proof; one way would be to follow systems which are already known to be confluent.

For example, beta-reduction is known to be confluent (via the Church-Rosser Theorem), so if we write all of our rules in the style of beta-reductions then we can be confident that our rules are confluent. Of course, that's exactly what functional programming languages do!

"

---

on runtime bounds-testing, i assume?:

" It was clear when Hoare stated on his speech in 1980:

"Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to - they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980, language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law." "

---

munificent 4 hours ago

> Is there a reason why one of these hasn't emerged/been adopted by the community?

Personally, I believe package management is one of those things that really does need an official blessed solution. Otherwise, you have a nasty bootstrapping problem: if there are ten competing package managers, how do you install them, and how do package developers know which one to put their packages in?

Collection types have the same problem. You basically need to put some collections in a blessed core library, otherwise it's virtually impossible to reliably share code. Any function that wants to return a list ends up having to pick one of N list implementations and which ever one they pick means their library is hard for users of the other N-1 lists to consume.

The Go team hasn't blessed a package manager, I think, because it's not that relevant to them: they mostly live within Google's own infrastructure which obviates the need for something like version management. They probably don't feel the pain acutely and/or might not have the expertise to design one that would work well outside Google.

reply

---

on Parrot's M0 and Lorito:

If you were truly interested in M0, any decent search engine, or even a trawl through one of several Perl 6 and Parrot Links pages, would have taken you to an article written in 2011 by one of the designers and developers of Lorito and M0. That article is Less Magic, Less C, A Faster Parrot, which says:

" The current stage of Lorito is M0, the "zero magic" layer of implementing a handful of operations which provide the language semantics of C without dragging along the C execution model. In other words, it's a language powerful enough to do everything we use C for without actually being C. It offers access to raw memory, basic mathematical operations, and Turing-complete branching while not relying on the C stack and C calling conventions.

This was the core of both the M0 design and Lorito itself. ... the Squeak Slang approach (or the Forth approach or...) that M0 intended " -- http://www.perlmonks.org/?node_id=1048142

http://www.modernperlbooks.com/mt/2011/07/less-magic-less-c-a-faster-parrot.html

(note that the person who wrote that, Chromatic, says "Update: M0 is dead, Parrot is effectively doomed, and the author believes that Rakudo is irrelevant. This post is now a historical curiosity."

Chromatic repeats here that M0 is dead: "After YAPC 2011, I did spend a little time working on a prototype of a smaller, faster core for Parrot, but that went nowhere." (the link talks about M0 and Lorito)

---

" PIR is a mostly terrible language in which to write a compiler. It's better than C in many ways. That's not high praise in the 21st century. " -- http://www.modernperlbooks.com/mt/2013/02/goodnight-parrot.html

---

http://pmthium.com/2014/10/apw2014/ isn't relevant to me, but i read it anyways, and i learned that (not surprisingly) Perl6 has a bunch of stuff that is the opposite of the 'simplicity' and straightforwardness that i want for Oot. Eg some things auto-flatten lists.

---

what do i mean/what is my strategy for 'simple'? Some notes:

---

some ways of thinking about programming languages, and about how the brain might work:

how does the brain or a programming language implement:

custom program representation loading in the custom program control flow primitive atomic data types composite data structure types primitive operations modules execution model routing memory/state short term (heap) medium term (main memory) long term (disk) medium and long term memory allocation and freeing concurrency not stepping on each other on shared resources including IO and memory ipc scheduling (lending/sharing computational resources from inactive processes) nondeterminism IPC sync what else? IO scaling up of system resources

---

from https://github.com/perl6/nqp/blob/master/docs/ops.markdown:

what are these? they look cool:

https://www.google.com/search?client=ubuntu&channel=fs&q=take+last+next+redo+succeed+proceed+warn&ie=utf-8&oe=utf-8

mb look at http://doc.perl6.org/language/control ---

mb not related at all but just in case:

http://eli.thegreenplace.net/2015/calling-back-into-python-from-llvmlite-jited-code/

also should probably check out llvmlite "A lightweight LLVM python binding for writing JIT compilers."

---

http://www.drdobbs.com/architecture-and-design/the-rebol-ios-distributed-filesystem/184405152 http://www.rebol.com/ios-intro.html

---

impl Print for u32 { fn print(&self) { println!("{}", self); } fn copy(&self) -> Self { *self } }

---

" A selection of database primitives... S ET O P ERATIO NS ( FO R RID LISTS )  Intersection  Difference  Union S O RTING  Merge Sort H ASH O P ERATIO NS  Integer Hashing  String Hashing  Hash Table Management ... " -- http://dsg.uwaterloo.ca/seminars/notes/2014-15/Lehner.pdf (no need to read that)

---

Transaction-Oriented Architecture shared-everything vs. Data-Oriented Architecture mixed shared-everything & shared-nothing

-- http://dsg.uwaterloo.ca/seminars/notes/2014-15/Lehner.pdf (no need to read that)

---

some db ops:

CRUD scan

---

" RDMA has three modes of communication, from fastest to slowest these are:

    One-sided RDMA (CPU bypass) which provides read, write, and two atomic operations fetch_and_add, and compare_and_swap.
    An MPI interface with SEND/RECV verbs, and
    An IP emulation mode that enables socket-based code to be used unmodified"

---

" Next up to I want to highlight hardware transactional memory (HTM) instruction support, available in the x86 instruction set architecture since Haswell as the “Transactional Synchronization Extensions.” It comes in two flavours, a backwards-compatible Hardware Lock Elison (HLE) instruction set, and a more flexible forward-looking Restricted Transactional Memory (RTM) instruction set.

Finally, as we saw yesterday, new persistent memory support is coming to give more control over flushing data from volatile cache into persistent memory. "

---

http://atomthreads.com/

"Atomthreads is a free, lightweight, portable, real-time scheduler for embedded systems."

---

http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ this general pattern applies to at least two things:

---

colin_mccabe 777 days ago

I think a big part of the issue is that in Java and C++, you really need generics a lot more than in Go. Without templates, you would not have any easy way of doing maps and lists in C++. There are no builtin types for those like in Go. The way the type system works in those languages also makes things very difficult if you don't have generics.

Think about the sorting example I wrote earlier: https://news.ycombinator.com/item?id=7080562 If you were writing it in Java, is sort.Sort an interface or an abstract base class? Well, you can only "extend" one class (single inheritance only), so you would probably want Sort to be a Java interface. That means that you would always have to implement all three functions, not just one as I did, since Java interfaces cannot have default implementations. The comments indicate that most posters didn't even consider the idea that you could reuse the StringSlice? functions. That method of easy composition simply doesn't exist in Java.

In general, generics get used a lot as a band-aid to avoid multiple inheritance in C++ and Java. You can't (or shouldn't, in C++) have your Foo inherit from both a (non-abstract) Bar and Baz. But you can certainly template on them. In C++, this kind of thing is called "traits" and Alexandrescu wrote a whole book about it. It's also why std::string is actually std::basic_string<char, std::char_traits<char>, std::allocator<char> >. In Go, you don't need all this... you just implement as many interfaces as you like and you're done.

---

" Cyclone implements three kinds of reference (following C terminology these are called pointers):

---

if a function has an inner function that is returned as a closure, then that inner function can access and modify the values of variables in the outer function. Does this mean that those variables that might be modified from the inner function must be prefixed with the '&' sigil? No, because & is only needed for 'non-local' modification, and "non-local" is delineated by lexical scope & threading; so because it's an inner function, it's within the lexical scope, and therefore 'local'. But if this inner function is passed to another thread, then the variable must indeed have a '&', because it is being accessed across threads (alternately we could just disallow that sort of thing and force communication over an explicit channel for that).

---

from emacs Info mode:

space for next page, delete for prev page, p for previous chapter, n for next chapter (and what for up and down?)

---

maybe .= for 'is'?

---

http://spacecraft.ssl.umd.edu/akins_laws.html

---

 pron 1 day ago

That also depends on the namespaces used by the language. E.g. auto-completion in Clojure works a lot better than in JS, because every symbol has a unique resolution, known statically (or as unique as in many typed languages). True, you get less helpful suggestions, but OTOH, the error messages can be much clearer than in the typed case. The former can leave error reporting to the DSL, while the latter is restricted to the cryptic types of the host language's compiler. I'm not saying this can't be resolved with some clever compiler plugins, but if we're comparing options available now, the advantage is not so clear-cut (I do favor types, but only as long as they don't get too complex).

reply

---

"

 You mean coding the stuff people are doing in LLVM and GCC in ML on CompCert or similar system? No, it's significantly easier to do that than get current architectures right in C-like languages. FOSS just doesnt do it for most part. Rust was exception: did theirs in Ocaml.

After FOSS compiler types build it, users can get the reproducible source and build the tool. Then that builds the other apps from source. See how easy that is?

Note: Wirth et al built a safe language, simple ASM, CPU, OS, apps, and all with a few people in a few years. The ASM-3GL-Compiler build is WAY easier than you think. It bootstraps faster one after. .... That's a large problem. The compiler part is smaller with lots of work in CompSci?, FOSS, and private sector (eg books). There's tools with source available on net for imperative and functional languages that are safer, too. Ignored almost entirely by safety or security oriented projects in compilers and general FOSS in favor of harder-to-analyze, less secure stuff. However, they'll happily bring up fad-driven stuff like Thompson attack or reproducible builds as The Solution.

Here's the actual solution. You start with a simple, non-optimizing toolchain designed and documented for easy understanding and implementation. It has extensive test suite. User worried about subversion implements that in tooling of their choice on own machine. Wirth simplified it with P-code interpreter that was easy to inplement with compiler and apps targeting it. Once first compiler is done, it compiles the HLL source of its own code. Now, you can use it to compile a high-performance compiler's source or add optimizations to it. Most of this work is done so it's a natter of FOSS compiler types or project teams just integrating and using it. ...

In 70's-80's, people designed, assured, and pentested guards with great results. Firewalls were a watered down version that came later with features but not assurance. Push guards on firewall proponents, even developers, then you'll just get ignored. They will work on whatever is making rounds on favorite IT or INFOSEC sites, though.

Compiler and OS people. They usually write their stuff in a monolithic style in C despite decades of bad results that way. Showing even one person (Edison), three (Lilith/Oberon), or handful (MINIX 3) can do entire system safer with less people and time will not change this. Showing them ML or something with C compilation for portability will not change this. They systematically reject this while doing whatever is their tradition or becomes in the vogue.

... " -- nickpsecurity

 nickpsecurity 202 days ago | parent

This is an old problem solved dozens of ways that mainstream just refuses to deal with. The requirement is even standard for proprietary products going for DO-178B certification. I believe they do quite manual confirmation but automated exists. The solution is called certified compilation: the verifiable transformation of source into binaries. You break the process into a series of steps which each can be verified with the CST's/AST's handed from one to the next. You can implement the steps yourself or validate someone else's, even easier if it's a safe[r] language. Examples each using different methods are VLISP [1], FLINT [2], and CompCert? C [3].

Running Debian through CompCert? while putting more work into CompCert? for portability and optimization is the easiest solution with long-term benefits. Performance will go up steadily. Bug count will go down steadily because that's what SML/Ocaml does. Code will be more readable. Repeat for most trusted tools to drive assurance up across the board.

If they don't want to do that, then the result will be something along lines of just having a bunch of people compile and sign the distro publishing signatures, etc. You will trust that they trusted whatever they all looked at. And anyone whose studied GCC's source, etc will know that basically means they all saw the same code. They'd have to understand it all to have known if there was a weakness introduced. They won't, use of C/C++ makes that harder, plenty of rope to hang one's self in any common action, and it's why those of us doing subversion-resistant development use languages like ML's or Oberon. FOSS needs to similarly transition toward safe, comprehensible tools that aren't backdoor generators just by architecture & language used.

Otherwise, all this talk of preventing subversion is just talk: they're going to get in. And if not subversion, the endless stream of 0-day's from the language and architectural choices will continue to do the job. A re-implementation of the TCB's of our systems is long overdue.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.2824&rep=rep1&type=pdf

[2] http://flint.cs.yale.edu/flint/software.html

[3] http://compcert.inria.fr/

...

Here's what subversion resistant development takes: modular software with sensible interfaces: ability to understand code for human review (closer to algorithm the better); ability to understand compiler passes in isolation; ability to implement toolchain in language of choosing. There are existing flows like this as I illustrated. So, you use them and leverage diverse audience to check results. I mean, would you rather implement CompCert? passes by hand or GCC even without optimizations? See the difference? ;)

Now, I did have a method to solve problem you're addressing. You implement an assembler first. Then a macro assembler with macro's for HLL primitives. You can use that immediately to jmplement certified compiler. Alternately, you can pick up Oberon report or Scheme book to implement that to get a true HLL plus compiler. Then you implememt the certified compiler with it. Comprehension, code complexity, and trust are kept manageable by building layer by layer. This, for productivity not security, is how Wirth and Carl first built Lilith then Oberon. Same method will work again and good that ML/Scheme/Oberon folks already gave us doc's plus code to use. Lets use them.

jeffreyrogers 201 days ago

Yep, building up like that would work. Oberon if I recall correctly is pretty simple too (maybe ~20k LOC?) so that would actually be possible by a small team.

nickpsecurity 201 days ago

It has many times. Nice, LISP-style example that was recently on HN:

https://speakerdeck.com/nineties/creating-a-language-using-only-assembly-language

Note: LISP/Scheme interpreters and processors with plenty of detail (including source) can be found with Google. Many implemented before 1990. Will run on cheap FPGA's or process nodes. Can take it all the way to hardware. ;)

The macro ASM can be built on something like P-code: an idealized, low-level machine easy to deploy on CISC and RISC architectures. A good example of how to bridge ASM and HLL's is Hyde's High Level Assembly:

http://www.plantation-productions.com/Webster/

The HLL, for non-LISP audience, can be Oberon with aid of Wirth's Compiler Construction book among other papers:

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

So, many possibilities. People just gotta use them. Build a LISP w/ macros to build everything else still seems to be easiest strategy. Esp as one can reuse code from textbooks unlikely to be subverted. Wirth's next best.

nickpsecurity 149 days ago

parent

That's not going to happen because the two toolchains won't be anything alike. You won't be able to make them alike either. Worse, that people often quote Thompson's attack shows that INFOSEC teaches subversion very poorly: it's the least likely attack to affect you and you really need to counter the others. What you have to do is make the software correct, make it secure, and ensure no subversion in lifecycle. That's called high assurance or robustness system design. Links below show you how to do that.

High assurance software design - nice intro http://web.cecs.pdx.edu/~hook/cs491sp08/AssuranceSp08.ppt

FOSS tools for high assurance http://www.dwheeler.com/essays/high-assurance-floss.html

Certified compilation (HW w/ need certified "synthesis") http://compcert.inria.fr/

Original work on subversion http://csrc.nist.gov/publications/history/myer80.pdf

Example of it in action http://www.cisr.us/downloads/theses/02thesis_anderson.pdf

Example of high assurance hardware (AAMP7G version is secure but no public paper...) http://www.csl.sri.com/papers/wift95/wift95.pdf

---

http://stackoverflow.com/questions/367115/is-there-a-python-equivalent-to-perl-pi-e

http://everythingsysadmin.com/perl2python.html

http://www.softpanorama.org/Scripting/Perlorama/perl_in_command_line.shtml

http://programmers.stackexchange.com/questions/65150/is-there-any-good-reason-for-someone-who-knows-python-to-learn-perl

---

mb resurrect list-context sigils (@ vs $) from perl5 so that when a list is prefixed with '$' it is treated as usual (a value that happens to be a list), whereas when it is in @ form it is implicitly mapped over (note: this isnt quite how Perl5 does it). But didn't Perl6 get rid of this? Or did they?

---

" Common practice in network monitoring and in QoS? technologies is to identify a flow of packets by the 5-tuple {source address, dest address, source port, dest port, protocol #}. " -- https://www.ietf.org/mail-archive/web/ipv6/current/msg11559.html

---

"This is relatively trivial at line speed in IPv4 since these things are at fixed locations in the header. But in IPv6, the protocol number is at the end of a linked list of "next headers." .. From today's perspective, the IPv6 header design is complete crap. Maybe it was optimized for software forwarding on in-order CPUs, but that's distant history now. "

-- https://www.ietf.org/mail-archive/web/ipv6/current/msg11559.html and https://www.ietf.org/mail-archive/web/ipv6/current/msg11681.html

---

IP packets have a 'protocol number'. It is a 1-byte field. TCP and UDP and ICMP are some of the choices.

On top of TCP/IP are various other protocols, like telnet, SMTP, HTTP, SSH, etc. They tend to be associated with 'well-known ports' The first 1024 ports are reserved as 'system ports' or 'well known ports', assigned by IANA. From 1024 to 49151 are 'user ports' or 'registered ports' which are also assigned by IANA, presumably much more easily ([1] says about the system ports, "The requirements for new assignments in this range are stricter than for other registrations.[2]"). There are currently some games registered in both the system ports (eg Doom at 666) and the user ports. From 40152-65535 are the 'ephemeral' or dynamic or private ports.

what are the common IP protocols?

http://www.cisco.com/c/en/us/td/docs/security/asa/asa90/configuration/guide/asa_90_cli_config.pdf lists ('Possible ASA protocol literal values'):

ahp, eigrp, esp, gre, icmp, igmp, igrp, ip, ipinip, ips ec, nos, ospf, pcp, snp, tcp, and udp

http://www.networking-forum.com/viewtopic.php?f=46&t=15498 lists:

TCP UDP mb DCCP SCTP RSVP ICMP - Used by pings, and for other network-level information. IGMP - Used with multicasting to join/leave groups. RSVP - Used to reserve bandwidth for a flow. Not very common, but discussed in almost all QOS texts. GRE - Used for tunneling.

https://technet.microsoft.com/en-us/library/cc959827.aspx?f=255&MSPPError=-2147217396 lists:

1 Internet Control Message Protocol (ICMP) 6 Transmission Control Protocol (TCP) 17 User Datagram Protocol (UDP) 47 General Routing Encapsulation (PPTP data over GRE) 51 Authentication Header (AH) IPSec 50 Encapsulation Security Payload (ESP) IPSec 8 Exterior Gateway Protocol (EGP) 3 Gateway-Gateway Protocol (GGP) 20 Host Monitoring Protocol (HMP) 88 Internet Group Management Protocol (IGMP) 66 MIT Remote Virtual Disk (RVD) 89 OSPF Open Shortest Path First 12 PARC Universal Packet Protocol (PUP) 27 Reliable Datagram Protocol (RDP) 46 Reservation Protocol (RSVP) QoS?

http://networkengineering.stackexchange.com/a/16212 (answer to http://networkengineering.stackexchange.com/questions/16191/raw-ip-communication ) says:

"

List of common IP protocols:

Nearly every application I can think of uses one of the following IPv4 protocols (in rough order of frequency people see on the wire):

    TCP (IP Protocol 6)
    UDP (IP Protocol 17)
    ICMP (IP Protocol 1)
    OSPF (IP Protocol 89)
    EIGRP (IP Protocol 88)
    GRE (IP Protocol 47)
    ESP (IP Protoocl 50)
    AH (IP Protocol 51)
    PIM (IP Protocol 103)
    IGMP (IP Protocol 2)
    VRRP (IP Protocol 112)
    SCTP (IP Protoocl 132)

That's it... statistically, every other IP protocol ranks in noise... and if it wasn't obvious enough, every one of those protocols performs at least one of the functions of an IP Transport protocol. "

http://etherape.sourceforge.net/introduction.html lists (this is more than just IP):

ETH_II, 802.2, 803.3, IP, IPv6, ARP, X25L3, REVARP, ATALK, AARP, IPX, VINES, TRAIN, LOOP, VLAN, ICMP, IGMP, GGP, IPIP, TCP, EGP, PUP, UDP, IDP, TP, ROUTING, RSVP, GRE, ESP, AH, EON, VINES, EIGRP, OSPF, ENCAP, PIM, IPCOMP, VRRP

http://teleco-network.blogspot.com/2012_11_01_archive.htmll lists:

AH, ARP/RARP, ATMP, BGMP, BGP-4, COPS, DCAP, DHCP, DHCPv6, DNS, DVMRP, EGP, EIGRP, ESP, FANP, Finger, FTP, GOPHER, GRE, HSRP, HTTP, ICMP, ICMPv6, ICP, ICPv2, IDRP, IGMP, IGRP, IMAP4, IMPP, IP, IPv6, IPDC, IRC, L2F, L2TP, LDAP, LDP, MARS, MDTP, Megaco (ASCII + ASN.1), Mobile IP, MZAP, NARP, Nat, NetBIOS?/IP, NHRP, NTP, OSPF, PIM, POP3, PPTP, Radius, RIP2, RIPng for IPv6, RSVP, RTSP, RUDP, SCSP, SCTP, SDCP , SLP, SMPP,SSH, SMTP, SNMP, SOCKS, TACACS+, TCP, TELNET, TFTP, TRIP, UDP, Van Jacobson, VRRP, WCCP, XOT, X-Window.

what are the common ports/application layer protocols?

[2] lists some well-known examples of system port applications:

    21: File Transfer Protocol (FTP)
    22: Secure Shell (SSH)
    23: Telnet remote login service
    25: Simple Mail Transfer Protocol (SMTP)
    53: Domain Name System (DNS) service
    80: Hypertext Transfer Protocol (HTTP) used in the World Wide Web
    110: Post Office Protocol (POP3)
    119: Network News Transfer Protocol (NNTP)
    123: Network Time Protocol (NTP)
    143: Internet Message Access Protocol (IMAP)
    161: Simple Network Management Protocol (SNMP)
    194: Internet Relay Chat (IRC)
    443: HTTP Secure (HTTPS)

http://brainmeta.com/forum/index.php?showtopic=9800 lists:

20 FTP data (File Transfer Protocol) 21 FTP (File Transfer Protocol) 22 SSH (Secure Shell) 23 Telnet 25 SMTP (Send Mail Transfer Protocol) 43 whois 53 DNS (Domain Name Service) 68 DHCP (Dynamic Host Control Protocol) 79 Finger 80 HTTP (HyperText? Transfer Protocol) 110 POP3 (Post Office Protocol, version 3) 115 SFTP (Secure File Transfer Protocol) 119 NNTP (Network New Transfer Protocol) 123 NTP (Network Time Protocol) 137 NetBIOS?-ns 138 NetBIOS?-dgm 139 NetBIOS? 143 IMAP (Internet Message Access Protocol) 161 SNMP (Simple Network Management Protocol) 194 IRC (Internet Relay Chat) 220 IMAP3 (Internet Message Access Protocol 3) 389 LDAP (Lightweight Directory Access Protocol) 443 SSL (Secure Socket Layer) 445 SMB (NetBIOS? over TCP) 666 Doom 993 SIMAP (Secure Internet Message Access Protocol) 995 SPOP (Secure Post Office Protocol)

https://en.wikibooks.org/wiki/Network_Plus_Certification/Technologies/Common_Protocols some protocols by layer, including some non-TCP-based application layer protocols (eg RTP which is usually over UDP):

Application DNS, TFTP, TLS/SSL, FTP, HTTP, IMAP4, POP3, SIP, SMTP, SNMP, SSH, Telnet, RTP Transport TCP, UDP Internet IP (IPv4, IPv6), ICMP, IGMP Link ARP

http://www.answersthatwork.com/Download_Area/ATW_Library/Networking/Network__2-List_of_Common_TCPIP_port_numbers.pdf lists:

ftp ssh telnet smtp wins_replication whois dns dhcp finger http x.400 pop3 sftp nntp ntp rpc_locator_service netbios_name_service_wins imap4 snmp BGP SRS LDAP SSL SMB gmail_outgoing ldap_over_ssl ssl_imap gmail_pop3

http://www.linuxnix.com/important-port-numbers-linux-system-administrator/ lists:

20 – FTP Data (For transferring FTP data) 21 – FTP Control (For starting FTP connection) 22 – SSH(For secure remote administration which uses SSL to encrypt the transmission) 23 – Telnet (For insecure remote administration 25 – SMTP(Mail Transfer Agent for e-mail server such as SEND mail) 53 – DNS(Special service which uses both TCP and UDP) 67 – Bootp 68 – DHCP 69 – TFTP(Trivial file transfer protocol uses udp protocol for connection less transmission of data) 80 – HTTP/WWW(apache) 88 – Kerberos 110 – POP3(Mail delivery Agent) 123 – NTP(Network time protocol used for time syncing uses UDP protocol) 137 – NetBIOS?(nmbd) 139 – SMB-Samba(smbd) 143 – IMAP 161 – SNMP(For network monitoring) 389 – LDAP(For centralized administration) 443 – HTTPS(HTTP+SSL for secure web access) 514 – Syslogd(udp port) 636 – ldaps(both tcp and udp) 873 – rsync 989 – FTPS-data 990 – FTPS 993 – IMAPS

http://web.mit.edu/rhel-doc/4/RH-DOCS/rhel-sg-en-4/ch-ports.html lists:

http://www.techexams.net/forums/network/2235-tcp-ip-port-assignments-print.html lists:

 FTP 21, SSH 22, Telnet 23, SMTP 25, DNS 53, TFTP 69, HTTP 80, POP3 110, NNTP 119, NTP 123, IMAP4 143, SNMP 161, HTTPS 443....also gopher finger 

http://www.sei.cmu.edu/reports/12tr006.pdf says: " The top expected services requested by clients on a typical network are web (ports 80 and 443), DNS (53), and SMTP (25) "

http://www.internet-computer-security.com/Firewall/Protocols/Ports-Protocols-IP-Addresses.html lists:

20 FTP (for File Transfer Protocol) – Data Port 21 FTP (File Transfer Protocol) – Command Port 22 SSH (Secure Shell) - Used for secure remote access 23 Telnet – Used for insecure remote access, data sent in clear text 25 SMTP (Simple Mail Transport Protocol) – Used to send email 53 DNS (Domain Name Service) – Used to resolve DNS names to public IP addresses 68 DHCP (Dynamic Host Configuration Protocol) – Used to assign IP addresses to clients 80 HTTP (Hypertext Transfer Protocol) - Used to browse the web 110 POP3 (Post Office Protocol, version 3) - Used to retrieve email from a server 115 SFTP (Secure File Transfer Protocol) - Secure file transfer 119 NNTP (Network News Transfer Protocol) – For transferring news articles between news servers 123 NTP (Network Time Protocol) For synchronising system time with a time server on the public network. 161 SNMP (Simple Network Management Protocol) For receiving system management alerts 163 IMAP (Internet Message Access Protocol 4) For retrieving emails 389 LDAP (Lightweight Directory Access Protocol) Querying directory services such as Active Directory 443 SSL (Secure Socket Layer) Using a secure web connection 445 SMB (Server Message Block) For shared access to files and printers

http://www.linuxsecurity.com/resource_files/firewalls/firewall-seen.html lists "common incoming TCP/UDP probes against my firewall":

1 tcpmux Indicates someone searching for SGI Irix machines. Irix is the only major vendor that has implemented tcpmux, and it is enabled by default on Irix machines. Irix machines ship with several default passwordless accounts, such as lp, guest, uucp, nuucp, demos, tutor, diag, EZsetup, OutOfBox?, and 4Dgifts. Many administrators forget to close these accounts after installation. Therefore, hackers scan the Internet looking first for tcpmux, then these accounts. [CA-95.15] 7 Echo You will see lots of these from people looking for fraggle amplifiers sent to addresses of x.x.x.0 and x.x.x.255.

A common DoS? attack is an echo-loop, where the attacker forges a UDP from one machine and sends it to the other, then both machines bounce packets off each other as fast as they can (see also chargen). [CA-96.01]

Another common thing seen is TCP connections to this port by DoubleClick?. They use a product called "Resonate Global Dispatch" that connects to this port on DNS servers in order to locate the closest one.

Harvest/squid caches will send UDP echoes from port 3130. To quote: If the cache is configured with source_ping on, it also bounces a HIT reply off the original host's UDP echo port. It can generate a lot of these packets. 11 sysstat This is a UNIX service that will list all the running processes on a machine and who started them. This gives an intruder a huge amount of information that might be used to compromise the machine, such as indicating programs with known vulnerabilities or user accounts. It is similar the contents that can be displayed with the UNIX "ps" command. ICMP doesn't have ports; if you see something that says "ICMP port 11", you probably want ICMP type=11. 19 chargen This is a service that simply spits out characters. The UDP version will respond with a packet containing garbage characters whenever a UDP packet is received. On a TCP connection, it spits out a stream of garbage characters until the connection is closed. Hackers can take advantage of IP spoofing for denial of service attacks. Forging UDP packets between two chargen servers, or a chargen and echo can overload links as the two servers attempt to infinitely bounce the traffic back and forth. Likewise, the "fraggle" DoS? attack broadcasts a packet destined to this port with a forged victim address, and the victim gets overloaded with all the responses. [CA-96.01] 21 FTP The most common attack you will see are hackers/crackers looking for "open anonymous" FTP servers. These are servers with directories that can be written to and read from. Hackers/crackers use these machines as way-points for transferring warez (pirated programs) and pr0n (intentionally misspelled word to avoid search engines classifying this document). 22 ssh pcAnywhere TCP connections to this port might indicate a search for ssh, which has a few exploitable features. Many versions using the RSAREF library can be exploited if they are configured in a certain fashion. (Suggestion: run ssh on some other port).

Also note that the ssh package comes with a program called make-ssh-known-hosts that will scan a domain for ssh hosts. You will sometimes be scanned from innocent people running this utility.

UDP (rather than TCP) packets directed at this port along with port 5632 indicate a scan for pcAnywhere. The number 5632 is (hex) 0x1600, which byte-swapped is 0x0016, which is 22 decimal. 23 Telnet The intruder is looking for a remote login to UNIX. Most of the time intruders scan for this port simply to find out more about what operating system is being used. In addition, if the intruder finds passwords using some other technique, they will try the passwords here. 25 SMTP Spammers are looking for SMTP servers that allow them to "relay" spam. Since spammers keep getting their accounts shut down, they use dial-ups to connect to high bandwidth e-mail servers, and then send a single message to the relay with multiple addresses. The relay then forwards to all the victims. SMTP servers (esp. sendmail) are one of the favorite ways to break into systems because they must be exposed to the Internet as a whole and e-mail routing is complex (complexity + exposure = vulnerability). 53 DNS DNS. Hackers/crackers may be attempting to do zone transfers (TCP), to spoof DNS (UDP), or even hide other traffic since port 53 is frequently neither filtered nor logged by firewalls.

An important thing to note is that you will frequently see port 53 used as the source UDP port. Stateless firewalls frequently allow such traffic on the assumption that it is a response to a DNS query. Hackers are increasingly exploiting this to pierce firewalls. 67 and 68 bootp DHCP Bootp/DHCP over UDP. Firewalls hooked to DSL and cable-modem lines see a ton of these sent to the broadcast address 255.255.255.255. These machines are asking to for an address assignment from a DHCP server. You could probably hack into them by giving them such an assignment and specifying yourself as the local router, then execute a wide range of man-in-the-middle attacks. The client requests configuration on a broadcast to port 68 (bootps). The server broadcasts back the response to port 67 (bootpc). The response uses some type of broadcast because the client doesn't yet have an IP address that can be sent to. 69 TFTP (over UDP). Many servers support this protocol in conjunction with BOOTP in order to download boot code to the system. However, they are frequently misconfigured to provide any file from the system, such as password files. They can also be used to write files to the system. 79 finger Hackers are trying to:

    discover user information
    fingerprint the operating system
    exploit known buffer-overflow bugs
    bounce finger scans through your machine to other machines. 

98 linuxconf The utility "linuxconf" provide easy administration of Linux boxen. It includes a web-enabled interface at port 98 through an integrated HTTP server. It has had a number of security issues. Some versions are setuid root, trust the local network, create world-accessible files in /tmp, and a buffer overflow in the LANG environment variable. Also, because it contains an integrated web server, it may be vulnerable to many of the typical HTTP exploits (buffer overruns, directory traversal using ../.., etc.). 109 POP2 POP2 is not nearly as popular as POP3 (see below), but many servers support both (for backwards compatibility). Many of the holes that can be exploited on POP3 can also be exploited via the POP2 port on the same server. 110 POP3 POP3 is used by clients accessing e-mail on their servers. POP3 services have many well-known vulnerabilities. At least 20 implementations are vulnerable to a buffer overflow in the username or password exchange (meaning that hackers can break in at this stage before really logging in). There are other buffer overflows that can be executed after successfully logging in. 111 sunrpc portmap rpcbind Sun RPC PortMapper?/RPCBIND. Access to portmapper is the first step in scanning a system looking for all the RPC services enabled, such as rpc.mountd, NFS, rpc.statd, rpc.csmd, rpc.ttybd, amd, etc. If the intruder finds the appropriate service enabled, s/he will then run an exploit against the port where the service is running.

Note that by putting a logging daemon, IDS, or sniffer on the wire, you can find out what programs the intruder is attempting to access in order to figure out exactly what is going on. 113 identd auth This is a protocol that runs on many machines that identifies the user of a TCP connection. In standard usage this reveals a LOT of information about a machine that hackers can exploit. However, it used by a lot of services by loggers, especially POP, IMAP, SMTP, and IRC servers. In general, if you have any clients accessing these services through a firewall, you will see incoming connection attempts on this port. Note that if you block this port, clients will perceive slow connections to e-mail servers on the other side of the firewall. Many firewalls support sending back a RST on the TCP connection as part of the blocking procedure, which will stop these slow connections. 119 NNTP news Network News Transfer Protocol, carries USENET traffic. This is the port used when you have a URL like news://comp.security.firewalls. Attempts on this port are usually by people hunting for open USENET servers. Most ISPs restrict access to their news servers to only their customers. Open news servers allow posting and reading from anybody, and are used to access newsgroups blocked by someone's ISP, to post anonymously, or to post spam.

Update: @Home has started scanning their subscribers to see if they are running USENET servers. They are doing this in order to find these servers and close them before spammers can take advantage of them. 135 loc-serv MS RPC end-point mapper Microsoft runs its DCE RPC end-point mapper for its DCOM services at this port.

This has much the same functionality as port 111 for UNIX systems. Services that use DCOM and/or RPC register their location with the end-point mapper on the machine. When clients remotely connect to the machine, they query the end-point mapper to find out where the service is. Likewise, hackers can scan the machine on this port in order to find out such things as "is Exchange Server running on this machine, and which version?".

This port is often hit in order to scan for services (for example, using the "epdump" utility), but this port may also be attacked directly. Currently, there are a few denial-of-service attacks that can be directed at this port. 137 NetBIOS? name service nbtstat (UDP) This is the most common item seen by firewall administrators and is perfectly normal. Please read the NetBIOS? section below for more details. 139 NetBIOS? File and Print Sharing Incoming connections to this port are trying to reach NetBIOS?/SMB, the protocols used for Windows "File and Print Sharing" as well as SAMBA. People sharing their hard disks on this port are probably the most common vulnerability on the Internet.

Attempts on this port were common at the beginning of 1999, but tapered off near the end. Now at the start of year 2000, attempts on this port have picked up again. Several VBS (IE5 VisualBasic? Scripting) worms have appeared that attempt to copy themselves on this port. Therefore, it may be worms attempting to propagate on this port. 143 IMAP4 Same security idea as POP3 above, numerous IMAP servers have buffer overflows that allow compromise during the login. Note that for awhile, there was a Linux worm (admw0rm) that would spread by compromising port 143, so a lot of scans on this port are actually from innocent people who have already been compromised. IMAP exploits became popular when Red Hat enabled the service by default on its distributions. In fact, this may have been the first widely scanned for exploit since the Morris Worm.

This port is also used for IMAP2, but that version wasn't very popular.

Several people have noted attacks from port 0 to port 143, which appears to be from some attack script. 161 SNMP (UDP) A very common port that intruders probe for. SNMP allows for remote management of devices. All the configuration and performance information is stored in a database that can be retrieved or set via SNMP. Many managers mistakeningly leave this available on the Internet. Crackers will first attempt to use the default passwords "public" and "private" to access the system; they may then attempt to "crack" the password by trying all combinations.

SNMP packets may be mistakenly directed at your network. Windows machines running HP JetDirect? remote management software uses SNMP, and misconfigured machines are frequent. HP OBJECT IDENTIFIERs will be seen in the packets. Newer versions of Win98 will use SNMP for name resolution; you will see packets broadcast on local subnets (cable modem, DSL) looking up sysName and other info. 162 SNMP trap Probably a misconfiguration. 177 xdmcp Numerous hacks may allow access to an X-Window console; it needs port 6000 open as well in order to really succeed. 513 rwho Probably from UNIX machines on your DSL/cable-modem segment broadcasting who is logged into their servers. These people are kindly giving you really interesting information that you can use to hack into their systems. 535 CORBA IIOP (UDP) If you are on a cable-modem or DSL VLAN, then you may see broadcasts to this port. CORBA is an object-oriented remote procedure call (RPC) system. It is highly likely that when you see these broadcasts, you can use the information to hack back into the systems generating these broadcasts. 600 pcserver backdoor See port 1524 for more info.

Some script kiddies feel they're contributing substantially to the exploit programs by making a minor change from ingreslock to pcserver in constant text... -- Alan J. Rosenthal. 635 mountd Linux mountd bug. This is a popular bug that people are scanning for. Most scans on this port are UDP-based, but they are increasingly TCP-based (mountd runs on both ports simultaneously). Note that mountd can run at any port (for which you must first do a portmap lookup at port 111), it's just that Linux defaulted to port 635 in much the same way that NFS universally runs at port 2049.

http://www.techotopia.com/index.php?title=Primary_TCP/IP_Port_Assignments_and_Descriptions&mobileaction=toggle_view_mobile lists:

ftp ssh telnet smtp dns tftp http pop3 nntp ntp imap4 snmp https (and nfs, which is at user port 2049)

https://www.digitalocean.com/community/tutorials/how-to-use-nmap-to-scan-for-open-ports-on-your-vps lists:

    20: FTP data
    21: FTP control port
    22: SSH
    23: Telnet <= Insecure, not recommended for most uses
    25: SMTP
    43: WHOIS protocol
    53: DNS services
    67: DHCP server port
    68: DHCP client port
    80: HTTP traffic <= Normal web traffic
    110: POP3 mail port
    113: Ident authentication services on IRC networks
    143: IMAP mail port
    161: SNMP
    194: IRC
    389: LDAP port
    443: HTTPS <= Secure web traffic
    587: SMTP <= message submission port
    631: CUPS printing daemon port

http://teleco-network.blogspot.com/2012_11_01_archive.html lists:

HTTP - hypertext transfer protocol TCP port 80 (application layer) SSL - Secure socket layers TCP port 443 SMTP - TCP port 25. Files stored in LocalDrive?:\Inetpub\Mailroot SNMP - simple network management protocol used to provide information about TCP/IP hosts, UDP port 161. FTP - only basic authentication allowed, TCP port 20 (data) TCP port 21 (control). Files stored in LocalDrive?:\Inetpub\Ftproot (application layer) POP - TCP port 110 DNS - UDP port 53 (query) TCP port 53 (zone transfer) NNTP - TCP port 119. Files stored in LocalDrive?:\Inetpub\Nntpfile\Root PPTP - Point to point tunneling protocol TCP port 1723; protocol number 47 L2TP/IPSec - UDP ports 500, 1701 and 4500; protocol number 50

http://etherape.sourceforge.net/introduction.html lists:

TELNET, FTP, HTTP, POP3, NNTP, NETBIOS, IRC, DOMAIN, SNMP

http://www.tcpipguide.com/free/t_CommonTCPIPApplicationsandAssignedWellKnownandRegi-2.htm lists:

Port #, TCP / UDP, Keyword, protocol abbrev, protocol

7 TCP + UDP echo — Echo Protocol 9 TCP + UDP discard — Discard Protocol 11 TCP + UDP systat — Active Users Protocol 13 TCP + UDP daytime — Daytime Protocol 17 TCP + UDP qotd QOTD Quote Of The Day Protocol 19 TCP + UDP chargen — Character Generator Protocol 20 TCP ftp-data FTP (data) File Transfer Protocol (default data port) 21 TCP ftp FTP (control) File Transfer Protocol (control / commands) 23 TCP telnet — Telnet Protocol 25 TCP smtp SMTP Simple Mail Transfer Protocol 37 TCP + UDP time — Time Protocol 43 TCP nicname — Whois Protocol (also called “Nicname”) 53 TCP + UDP domain DNS Domain Name Server (Domain Name System) 67 UDP bootps BOOTP / DHCP Bootstrap Protocol / Dynamic Host Configuration Protocol (Server) 68 UDP bootpc BOOTP / DHCP Bootstrap Protocol / Dynamic Host Configuration Protocol (Client) 69 UDP tftp TFTP Trivial File Transfer Protocol 70 TCP gopher — Gopher Protocol 79 TCP finger — Finger User Information Protocol 80 TCP http HTTP Hypertext Transfer Protocol (World Wide Web) 110 TCP pop3 POP Post Office Protocol (version 3) 119 TCP nntp NNTP Network News Transfer Protocol 123 UDP ntp NTP Network Time Protocol 137 TCP + UDP netbios-ns — NetBIOS? (Name Service) 138 UDP netbios-dgm — NetBIOS? (Datagram Service) 139 TCP netbios-ssn — NetBIOS? (Session Service) 143 TCP imap IMAP Internet Message Access Protocol 161 UDP snmp SNMP Simple Network Management Protocol 162 UDP snmptrap SNMP Simple Network Management Protocol (Trap) 179 TCP bgp BGP Border Gateway Protocol 194 TCP irc IRC Internet Relay Chat 443 TCP https HTTP over SSL Hypertext Transfer Protocol over Secure Sockets Layer 500 UDP isakmp IKE IPSec Internet Key Exchange 520 UDP router RIP Routing Information Protocol (RIP-1 and RIP-2) 521 UDP ripng RIPng Routing Information Protocol - “Next Generation”

and in user ports:

1512 TCP + UDP wins WINS Microsoft Windows Internet Naming Service 1701 UDP l2tp L2TP Layer Two Tunneling Protocol 1723 TCP pptp PPTP Point-To-Point Tunneling Protocol 2049 TCP + UDP nfs NFS Network File System 6000 - 6063 TCP x11 X11 X Window System


http://cointelegraph.com/news/tau-chain-a-decentralized-app-store-with-greater-flexibility-than-ethereum

---

http://amznlabs.github.io/ion-docs/index.html https://github.com/amznlabs/ion-java

leef 14 hours ago

Finally! I've had to live the JSON nightmare since I left Amazon.

Some of the benefits over JSON:

reply

efaref 6 hours ago

You could have used CBOR for many of those things (http://cbor.io/).

reply

conradev 14 hours ago

Sounds a lot like Apple's property list format, which shares almost everything you listed in common, except for annotations and symbol tables.

Its binary format was introduced in 2002!

Edit: Property lists only support integers up to 128 bits in size and double-precision floating point numbers. On top of those, Ion also supports infinite precision decimals.

reply

JonathonW? 11 hours ago

Plists are nifty, but the text format's XML-based, which makes it too complex and too verbose to be a general-purpose alternative to something like JSON.

(plutil "supports" a json format, but it's not capable of expressing the complete feature set of the XML or binary formats.)

reply

pjmlp 10 hours ago

I don't get this gripe with XML, it is meant to be used by tools not to be written by hand.

Where is the XPath and XQuery for JSON?

Do people really think that manually iterating over the whole JSON document to find the data or writing yet another parser, is better?

reply

jon-wood 9 hours ago

http://jmespath.org/

reply

pjmlp 9 hours ago

Any solid Java, C#, C++ libraries?

reply

bct 3 hours ago

The XML serialization Apple defined for plists is awful and not easy to query with XPath.

reply

jonhohle 12 hours ago

Like Property Lists the binary format is TLV encoded as well. Ion has a more compact binary representation for the same data and additional types and metadata. Also, IIRC, Plist types are limited to 32-bit lengths for all data types. The binary Ion representation has no such restriction (though in practice sizes are often limited by the language implementation).

reply

kazinator 11 hours ago

I Consider this Harmful (TM) and will oppose the adoption in every organization where I have an opportunity to voice such. (In its present form, to be clear!)

There is no need to have a null which is fragmented into null.timestamp, null.string and whatever. It will complicate processing. Just because you know the type of some element is timestamp, you must worry whether or not it is null and what that means.

There should be just one null value, which is its own type. A given datum is either permitted to be null OR something else like a string. Or it isn't; it is expected to be a string, which is distinct from the null value; no string is a null value.

It's good to have a read notation for a timestamp, but it's not an elementary type; a timestamp is clearly an aggregate and should be understood as corresponding to some structure type. A timestamp should be expressible using that structure, not only as a special token.

This monstrosity is not exhibiting good typing; it is not good static typing, and not good dynamic typing either. Under static typing we can have some "maybe" type instead of null.string: in some representations we definitely have a string. In some other places we have a "maybe string", a derived type which gives us the possibility that a string is there, or isn't. Under dynamic typing, we can superimpose objects of different type in the same places; we don't need a null version of string since we can have "the" one and only null object there.

This looks like it was invented by people who live and breathe Java and do not know any other way of structuring data. Java uses statically typed references to dynamic objects, and each such reference type has a null in its domain so that "object not there" can be represented. But just because you're working on a reference implementation in such a language doesn't mean you cannot transcend the semantics of the implementation language. If you want to propose some broad interoperability standard, you practically must.

reply

wyc 15 hours ago

This reminds me a lot of Avro:

https://avro.apache.org/docs/current/

They both have self-describing schemas, support for binary values, JSON-interoperability, basic type systems (Ion seems to support a few more field types), field annotations, support for schema evolution, code generation not necessary, etc.

I think Avro has the additional advantages of being production-tested in many different companies, a fully-JSON schema, support for many languages, RPC baked into the spec, and solid performance numbers found across the web.

I can't really see why I'd prefer Ion. It looks like an excellent piece of software with plenty of tests, no doubt, but I think I could do without "clobs", "sexprs", and "symbols" at this level of representation, and it might actually be better if I do. Am I missing something?

reply

jcrites 14 hours ago

What do you mean by they both have self-describing schemas? In order to read or write Avro data, an application needs to possess a schema for that data -- the specific schema that the data was written with, and (when writing) the same schema that a later reader expects to find. This means the data is not self-describing.

Ion is designed to be self-describing, meaning that no schema is necessary to deserialize and interact with Ion structures. It's consequently possible to interact with Ion in a dynamic and reflective way, for example, in the same way that you can with JSON and XML. It's possible to write a pretty-printer for a binary Ion structure coming off the wire without having any idea of or schema for what's inside. Ion's advantage over those formats is that it's strongly typed (or richly typed, if you prefer). For example, Ion has types for timestamps, arbitrary-precision decimals like for currency, and can embed binary data directly (without base64 encoding), etc.

I wouldn't try to say that one or the other is better across the board. Rather, they have tradeoffs and relative strengths in different circumstances. Ion is in part designed to tackle scenarios like where your data might live a really long time, and needs to be comprehensible decades from now (whether you kept track of the schema or not, or remember which one it was); and needs to be comprehensible in a large distributed environment where not every application might possess the latest schema or where coordinating a single compile-time schema is a challenge (maybe each app only cares about some part of the data), and so on. Ion is well-suited to long-lived, document-type data that's stored at rest and interacted with in a variety of potentially complex ways over time. Data data. In the case of a simple RPC relationship between a single client and service, where the data being exchanged is ephemeral and won't stick around, and it's easy to definitively coordinate a schema across both applications, a typical serialization framework is a fine choice.

reply

wyc 14 hours ago

I think it depends on what level you're referring to. If you mean record-level, then I concede that it's not self-describing. However, looking at the suggested use cases, it seems that it's "self-describing" in that you'll always be able to decode data stored according to what the documentation recommends:

"Avro data is always serialized with its schema. Files that store Avro data should always also include the schema for that data in the same file. Avro-based remote procedure call (RPC) systems must also guarantee that remote recipients of data have a copy of the schema used to write that data."

https://avro.apache.org/docs/current/spec.html#Data+Serializ...

reply

jcrites 13 hours ago

That's interesting. I didn't know that about Avro. Does the framework take responsibility for including the schema and defining a format consisting of schema plus data, or is that the responsibility of the application layer? It sounds like that might just be a convention or best practice recommended in the documentation, rather than a technical property of Avro itself.

If it's the application's responsibility to bundle the schema in Avro, then one difference is that Ion takes responsibility for embedding schema information along with each structure and field. Ion is also capable of representing data where there is no schema (analogy: a complex document like an HTML5 page), or working efficiently with large structures without deserializing everything even if the application needs data in just one field.

Another platform in contrast with Ion is Apache Parquet [1]. Parquet's support for columnar data means that it can serialize and compress table-like data extremely efficiently (it serializes all values in one column, followed by the next, until the end of a chunk -- enabling efficient compression as well as efficient column scans). Ion by comparison would serialize each row and field within it in a self-describing way (even though that information is redundant, in this particular case, since all rows are the same). Great flexibility and high fidelity at the expense of efficiency.

[1] https://parquet.apache.org/documentation/latest/

reply

aeroevan 4 hours ago

Avro files have a header which has metadata including the schema as well as things like compression codec (supports deflate and snappy) and all of the implementations that I have used (java and python bindings mostly) just does this in the background.

Another fun thing is that avro supports union types, so to make things nullable you just union[null, double] or whatever.

But one of the best things about avro (and parquet for that matter) is that it is well supported by the hadoop ecosystem

reply

andrioni 7 hours ago

In the spec[1] there is a definition of an "object container file" which includes the schema, and is the default format used whenever you save an Avro file. You can even use it whenever sending Avro data through the wire, if you don't mind paying the extra space cost.

[1]: http://avro.apache.org/docs/1.7.7/spec.html

reply

wyc 13 hours ago

I think libraries generally take care of stuffing the schema into the wire protocol, and I have a hunch you're right in that it's implementation-defined.

I like that in this regard, any individual record in Ion is standalone. I can think of a few ways that could come in handy, e.g., a data packet of nested mixed-version records. Did not know about Paraquet, thanks!

reply

umanwizard 15 hours ago

Amazon invented Ion because yaml, Avro, etc. didn't exist at the time. Ion is actually pretty old.

The timing of open-sourcing it mystifies me a bit. Maybe Amazon is trying to become more open-source friendly, like Microsoft did?

Perhaps more likely: they're planning on making some internal APIs that use ION heavily public?

reply

umanwizard 15 hours ago

Amazon invented Ion because yaml, Avro, etc. didn't exist at the time. Ion is actually pretty old.

The timing of open-sourcing it mystifies me a bit. Maybe Amazon is trying to become more open-source friendly, like Microsoft did?

Perhaps more likely: they're planning on making some internal APIs that use ION heavily public?

reply

deathanatos 9 hours ago

I can't decide if "JSON-superset" is technically accurate or not.

JSON's string literals come from JavaScript?, and JavaScript? only sortof has a Unicode string type. So the \u escape in both languages encodes a UTF-16 code unit, not a code point. That means in JSON, the single code point U+1f4a9 "Pile of Poo" is encoded thusly:

    "\ud83d\udca9"

JSON specifically says this, too,

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "\u005C".
   [… snip …]
   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

Now, Ion's spec says only:

   U+HHHH	\uHHHH	4-digit hexadecimal Unicode code point

But if we take it to mean code point, then if the value is a surrogate… what should happen?

Looking at the code, it looks like the above JSON will parse:

  1. Main parsing of \u here:
     https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434
  2. which is called from here, and just appended to a StringBuilder:
     https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975

My Java isn't that great though, so I'm speculating. But I'm not sure what should happen.

This is just one of those things that the first time I saw it in JSON/JS… a part of my brain melted. This is all a technicality, of course, and most JSON values should work just fine.

reply

jrgv 9 hours ago

> But if we take it to mean code point, then if the value is a surrogate… what should happen?

Surrogates are code points. The spec does not say what should happen if the surrogate is invalid (for example, if only the first surrogate of a surrogate pair is present), but neither does the JSON spec.

Java internally also represents non-BMP code points using surrogates. So, simply appending the surrogates to the string should yield a valid Java string if the surrogates in the input are valid.

reply

escherize 15 hours ago

Is there a source for benchmarks/reviews for the various ways to represent data? As far as I see it, there are a lot of them that I'd like to hear pros/cons for: json, edn + transit (my fave), yaml, google protobufs, thrift (?), as well as Ion.

And where does Ion fit here?

reply

nitrogen 14 hours ago

MessagePack? is quite fast and the newest version has binary fields, but it lacks the rich datatypes like decimals and timestamps mentioned by another commenter. If Ion is as fast and has adequate language support, it sounds like it would be a good first choice for a new project.

Edit: There is a benchmark script that tests a few serializers and validators in Ruby in my [employer's] ClassyHash? gem: https://github.com/deseretbook/classy_hash/. It would be easy to add more serializers to the benchmark: https://github.com/deseretbook/classy_hash/blob/master/bench...

reply

jcrites 14 hours ago

Ion's advantage is that it's both strongly-typed with a rich type system, as well as self-describing.

Data formats like JSON and XML can be somewhat self-describing, but they aren't always completely. Both tend to need to embed more complex data types as either strings with implied formats, or nested structures. (Consider: How would you represent a timestamp in JSON such that an application could unambiguously read it? An arbitrary-precision decimal? A byte array?) I'm not familiar with EDN, but it appears to be in a similar position as JSON in this regard. ProtocolBuffers?, Thrift, and Avro require a schema to be defined in advance, and only work with schema-described data as serialization layers. Ion is designed to work with self-describing data that might be fairly complex, and have no compiled-ahead-of-time schema.

Ion makes it easy to pass data around with high fidelity even if intermediate systems through which the data passes understand only part of the data but not all of it. A classic weakness of traditional RPC systems is that, during an upgrade where an existing structure gains an additional field, that structure might pass through an application that doesn't know about the field yet. Thus when the structure gets deserialized and serialized again, the field is missing. The Ion structure by comparison can be passed from the wire to the application and back without that kind of loss. (Some serialization-based frameworks have solutions to this problem too.)

One downside is that its performance tends to be worse than schema-based serialization frameworks like Thrift/ProtoBuf?/Avro where the payload is generally known in advance, and code can be generated that will read and deserialize it. Another downside is that it's difficult to isolate Ion-aware code from the more general purpose "business logic" in an application, due to the absence of a serialization layer producing/consuming POJOs; instead it's common to read an Ion structure from the wire and access it directly from application logic.

reply

brandonbloom 13 hours ago

EDN supports dates, etc, too.

However, it doesn't support blobs. I'm conflicted about this point. On one hand, small blobs can occasionally be useful to send within a larger payload. On the other hand, small blobs almost always become large blobs, and so I'd rather plan for out-of-band (preferably even content addressable) representations of blobs.

reply

zapov 9 hours ago

For JVM most popular benchmark is https://github.com/eishay/jvm-serializers/wiki

reply

eyan 12 hours ago

Surprised nobody mentioned CBOR (http://cbor.io) yet. Aka RFC 7049 (http://tools.ietf.org/html/rfc7049).

reply

LVB 12 hours ago

It is referenced in the Ion docs: http://amznlabs.github.io/ion-docs/index.html

reply

brianolson 1 hour ago

They complain about how CBOR is a superset of JSON data types and so some CBOR values (like bignum) might not down-convert to JSON cleanly, and then in the next paragraph they talk about how Ion is a superset of JSON data types including 'arbitrary sized integers'. Bad doubletalk. Boo. (I have implemented CBOR in a couple languages and like it. Every few months we get to say, "oh look, _another_ binary JSON.")

reply

eyan 3 hours ago

@LVB, thanks for that. RTFM-ing made me think twice about adopting CBOR or going with Ion. I'll also mention Velocypack (https://github.com/arangodb/velocypack) while here.

reply

Wasn't this solved already by the BSON specification - http://bsonspec.org ? Sure this allows you a definition of types, but this could easily be done using standard JSON meta data for each field. I find BSON simpler and more elegant.

reply

duskwuff 13 hours ago

BSON is awful.

reply

_wmd 12 hours ago

Most of this comes from BSON also being the internal storage format for a database server. For example, at least the redundant string NULs make it possible to use C library functions without copying, the unpacked ints allow direct dereferencing, etc.

I've no clue about the trailing NUL on the record itself, perhaps a safety feature?

reply

saosebastiao 13 hours ago

Do any of the popular message serialization formats have first class support for algebraic data types? It seems like every one I've researched has to be hacked in some way to provide for sum types.

reply

QuercusMax? 13 hours ago

Protocol buffers support oneof, which is a union type. https://developers.google.com/protocol-buffers/docs/proto#on...

(Insert joke here about Google engineers just copying around protobufs.)

reply

koloron 11 hours ago

Nearly the same question was recently asked in r/haskell:

https://www.reddit.com/r/haskell/comments/4fhuw3/json_for_ad...

reply

kevinSuttle 13 hours ago

Would like to see a comparison to EDN. https://github.com/edn-format/edn

reply

userbinator 13 hours ago

Almost every time I see yet another structured data format I'm surprised at the number of people who haven't ever heard of ASN.1, despite it forming the basis of many protocols in widespread use.

reply

_wmd 12 hours ago

Usual ASN.1 caveat: parsing its specifications requires money and a lot of time, implementing many of its encodings (e.g. unaligned PER) is a lifetime's work, and even the simpler ones thousands of eyes haven't managed to get right despite years of effort (see OpenSSL?, NSPR, etc)

ASN.1 also has a million baroque types (VideotexString?, anyone?) where most people just need "string", "small int", "big int", etc.

Some more on BER parsing hell here: https://mirage.io/blog/introducing-asn1

reply

userbinator 12 hours ago

Usual ASN.1 caveat: parsing its specifications requires money and a lot of time, implementing many of its encodings (e.g. unaligned PER) is a lifetime's work

...unless you're Fabrice Bellard, who apparently wrote one just because it was one of the minor obstacles on the way to writing a full LTE base station:

http://www.bellard.org/ffasn1/

reply

---

https://en.wikipedia.org/wiki/Fluidics

---

.NET CLR Managed Runtime - https://github.com/dotnet/coreclr

.NET Framework - https://github.com/dotnet/corefx

.NET Compiler as a Service ("Roslyn") - https://github.com/dotnet/roslyn

.NET Orleans Actor Framework - https://github.com/dotnet/orleans

Mono Framework - https://github.com/mono/mono

Xamarin iOS, Watch, Mac Bindings and Framework - https://github.com/xamarin/xamarin-macios

Xamarin Android Bindings and Framework - https://github.com/xamarin/xamarin-android

---

"

optforfon 23 hours ago

The more I learn C the more I hate it. At first it seems simple and easy but reading "Expert C Programming" is reading a laundry list of what's really messed up with the language. 80% of the problems would be solved by some sane syntactic sugar that compiles down to C

reply

ams6110 23 hours ago

I would say generally I feel that way about any language I've learned. Initially they are pretty easy and the examples given include slick solutions to contrived problems. Then you get into wanting to do real work and you learn about all the corner cases and ambiguities and landmines that are hidden farther afield.

Can anyone name a language that they've grown to like more the more they learned about it? I would guess maybe only those in the LISP family would make the cut.

reply

groovy2shoes 8 minutes ago

> Can anyone name a language that they've grown to like more the more they learned about it? I would guess maybe only those in the LISP family would make the cut.

You're correct that a few of the Lisps have had this effect for me (Scheme, Common Lisp, EuLisp?, Le-Lisp, elisp), but there have been a few dialects that I came to like less and less as I learned them (note that I do not necessarily dislike them, rather I'm disappointed by them): Clojure, Newlisp, and Racket, for example.

There are a few non-Lisps that I find myself liking more and more as I use them: Kitten and Mantra. They're both concatenative languages that take somewhat different approaches from the usual Forth-likes. I'm still not super proficient with them, though, so it's possible that the derivative of my fondness for them will invert yet. I've also had that experience with ksh (believe it or not) and Vim (a DSL for editing, if you will).

There have also been a few languages that have had a sort of "roller coaster" effect: at first they excited me greatly, then as I learned them better and better, I liked them more, then less, then more…. Some examples that come to mind are C#, Datalog, Haskell, Modula-3, OCaml, Prolog, Rust, and Standard ML.

reply

ArkyBeagle? 23 hours ago

My favorite language remains 'C'. You don't have to even look at the corner cases, landmines and ambiguities - use the subset of the language that works.

Other than things like signed integer weirdness, most complaints about 'C' revolve around the library. Well, don't use those parts of the library.

reply

MichaelBurge? 23 hours ago

Haskell can be annoying to work with in the beginning, since the slick quicksort examples don't really explain why you'd want to use it to write something real. But then your project grows bigger, and the type system maintains sanity in a way you don't otherwise see. Other languages usually use tons of unit tests everywhere to enable refactoring, but then they bog you down if you really need to change something.

I guess it has problems with records, space leaks, and deployment to old machines. Maybe ML or F# are just as good for this purpose; I haven't really used them.

reply

jsmith0295 20 hours ago

Objective-C, actually. I had really only used Java and a little C++ and PHP before I learned it to start doing iOS development (still iPhone OS at the time), and while I initially hated the syntax and found it confusing, I really ended up appreciating a lot about how things were done in the language. Although that was mostly things which originated in Smalltalk.

reply

steffan 13 hours ago

I have really enjoyed Scala. The more depth I gain in the language, the more I appreciate it. It has the ability to be compact without being overly terse, and is very readable if you don't get too crazy with using symbol overloading and overly-complicated types.

reply

RodericDay? 23 hours ago

I'm not a super genius programmer by any means, but I've had a lot of pleasant surprises learning about the depths of Python.

reply "

---

Negative1 22 hours ago

This is exciting but I've been burned in the past. Your commit graph seems to show you're serious so I hope you stick with it.

Will probably have more questions after I browse through the code but some basics: Q: How much of the collections library is covered with full support? Partial support? NYI w/ ETA? Q: Cross-platform support? If I wanted to run this on iOS and Android today, what do I have to do? What about on Windows or OSX? Q: How does GC work? Is there a way to take more control over GC for Scala objects? Q: Is it possible to ‘link’ in external dependencies, i.e. Joda Time. How do I do this right now? Q: Does the compiler use a translation layer? If so, is it consistent with Java 7 or 8?

On another note, the c extern stuff is excellent — exactly what I’d like to see for portability and performance (without the headache of something like the JNI).

reply

codecamper 21 hours ago

What are your mobile plans? :) I realize this is probably a ways off, but I've dreamed of being able to write core logic in a language like Scala & then using that code from ios & android.

Will scala native generate some c-linkable files?

reply

densh 18 hours ago

iOS/Android support is the most requested feature. Stay tuned for updates on that front.

Scala Native generates LLVM IR that can be compiled to C-linkable code but for now we focus on "one statically linked application at a time" use case.

reply

steeleduncan 20 hours ago

What are the plans with tail call optimisation in this version? or do you get that for free from LLVM?

reply

densh 18 hours ago

We get that for free from LLVM.

reply

airless_bar 19 hours ago

The slides mention that proper tail calls "just work", even mutual ones. :-)

reply

---