"To be a viable computer system, one must honor a huge list of large, and often changing, standards: TCP/IP, HTTP, HTML, XML, CORBA, Unicode, POSIX, NFS, SMB, MIME, POP, IMAP, X, ... A huge amount of work, but if you don’t honor the standards you’re marginalized. Estimate that 90-95% of the work in Plan 9 was directly or indirectly to honor externally imposed standards." -- (Rob Pike talk)

Plan 9 notes:

syscalls from :

bind open read dup fork seek stat remove fd2path getwd pipe exec chdir segattach exits wait sleep notify lock errstr

9P protocol messages from :

version auth error flush attach walk open create read write clunk remove stat wstat

version auth error flush clunk are protocol metastuff (flush aborts a transaction, clunk deallocates a file descriptor), leaving:

ddevault on Jan 28, 2018 [-]

The kernel is simple and sanely designed and interfacing with it is done through the filesystem in, again, a very simple and sane way. Your displays are configured by a simple <10 line shell script at boot time that writes plaintext to a device file, rather than a gargantuan graphics stack that's controlled with binary device files and ioctls. Filesystem namespaces are lovely, and make the filesystem cleanly organized and customized to the logged in user's needs and desires. I have a little script that clears my current terminal window: `echo -n "" > /dev/text`. The libc is small and focused (non-POSIX), and there are only a handful of syscalls. The process model is really sane and straightforward as well. Playing an mp3 file is `mp3dec < file.mp3 > /dev/audio`.

To open a TCP connection, you use the dial(3) function, which basically does the following: write "tcp!name!port" to /net/cs and read out "!80" (/net/cs resolves the name), then you write "connect" to /net/tcp/clone and read out a connection ID, and open /net/tcp/:id/data which is now a full-duplex TCP stream.

There's this emphasis on simple, sane ways of fulfilling tasks on plan9 that permeates the whole system. It's beautiful.

" Fuchsia: a new operating system

Nur Hussein

Fuchsia is a new operating system being built more or less from scratch at Google. ... At the heart of Fuchsia is the Magenta microkernel... LK, the kernel that Magenta builds upon, was created by Fuchsia developer Travis Geiselbrecht before he joined Google. LK's goal is to be a small kernel that runs on resource-constrained tiny embedded systems (in the same vein as FreeRTOS? or ThreadX?). Magenta, on the other hand, targets more sophisticated hardware (a 64-bit CPU with a memory-management unit is required to run it), and thus expands upon LK's limited features. Magenta uses LK's "inner constructs", which is comprised of threads, mutexes, timers, events (signals), wait queues, semaphores, and a virtual memory manager (VMM). For Magenta, LK's VMM has been substantially improved upon.

One of the key design features of Magenta is the use of capabilities....Capabilities are implemented in Magenta by the use of constructs called handles....Almost all system calls require that a handle be passed to them. Handles have rights associated with them... The rights that can be granted to a handle are for reading or writing to the associated kernel object or, in the case of a virtual memory object, whether or not it can be mapped as executable....Since memory is treated as a resource that is accessed via kernel objects, processes gain use of memory via handles. Creating a process in Fuchsia means a creator process (such as a shell) must do the work of creating virtual memory objects manually for the child process. This is different from traditional Unix-like kernels such as Linux, where the kernel does the bulk of the virtual memory setup for processes automatically. Magenta's virtual memory objects can map memory in any number of ways, and a lot of flexibility is given to processes to do so. One can even imagine a scenario where memory isn't mapped at all, but can still be read or written to via its handle like a file descriptor. While this setup allows for all kinds of creative uses, it also means that a lot of the scaffolding work for processes to run must be done by the user-space environment.

Since Magenta was designed as a microkernel, most of the operating system's major functional components also run as user-space processes. This include the drivers, network stack, and filesystems. The network stack was originally bootstrapped from lwIP, but eventually it was replaced by a custom network stack written by the Fuchsia team. The network stack is an application that sits between the user-space network drivers and the application that requests network services. A BSD socket API is provided by the network stack.

The default Fuchsia filesystem, called minfs, was also built from scratch... since the filesystems run as user-space servers, accessing them is done via a protocol to those servers....The user-space C libraries make the protocol transparent to user programs, which will just make calls to open, close, read, and write files. ... Full POSIX compatibility is not a goal for the Fuchsia project; enough POSIX compatibility is provided via the C library, which is a port of the musl project to Fuchsia..."

Khaine 13 hours ago [-]

One thing I don't see addressed in the README is why? Why do we need Fuchsia? What problem are we trying to solve? Why should I use/develop for it instead of Windows/Linux/macOS?

Or is this just a research operating system designed to test new ideas out?


akavel 12 hours ago [-]

The main keywords are "capability based" and "microkernel". Those ideas bring numerous advantages over monolithic kernels (including Linux, Windows, macOS), especially humongous boost to protection against vulnerabilities, also better reliability and modularity. They are quite well researched already AFAIU, and apparently the time has come for them to start breaking through to "mainstream" (besides Fuchsia, see e.g.,

Other than that, for Google this would obviously bring total control over the codebase, allowing them to do whatever they want, and super quickly, not needing to convince Linus or anybody else.


naasking 21 hours ago [-]

Some problems I see from skimming the docs:

> Calls which have no limitations, of which there are only a very few, for example zx_clock_get() and zx_nanosleep() may be called by any thread.

Having the clock be an ambient authority leaves the system open to easy timing attacks via implicit covert channels. I'm glad these kinds of timing attacks have gotten more attention with Spectre and Meltdown. Capability security folks have been pointing these out for decades.

> Calls which create new Objects but do not take a Handle, such as zx_event_create() and zx_channel_create(). Access to these (and limitations upon them) is controlled by the Job in which the calling Process is contained.

I'm hesitant to endorse any system calls with ambient authority, even if it's scoped by context like these. It's far too easy to introduce subtle vulnerabilities. For instance, these calls seem to permit a Confused Deputy attack as long as two processes are running in the same Job.

Other notes on the kernel:

Looks like they'll also support private namespacing ala Plan 9, which is great. I hope we can get a robust OS to replace existing antiquated systems with Google's resources. This looks like a good start.


overview from the Lisp Machine manual



Associated VMs


" In this spirit, the L4 microkernel provides few basic mechanisms: address spaces (abstracting page tables and providing memory protection), threads and scheduling (abstracting execution and providing temporal protection), and inter-process communication (for controlled communication across isolation boundaries).

An operating system based on a microkernel like L4 provides services as servers in user space that monolithic kernels like Linux or older generation microkernels include internally. For example, in order to implement a secure Unix-like system, servers must provide the rights management that Mach included inside the kernel. " [1]

" Microkernels minimize the functionality that is provided by the kernel: The kernel provides a set of general mechanisms, while user-mode servers implement the actual operating system (OS) services [Brinch Hansen 1970; Levin et al. 1975]. Application code obtains a system service by communicating with servers via an interprocess com- munication (IPC) mechanism, typically message passing. Hence, IPC is on the critical path of any service invocation, and low IPC costs are essential.

By the early 1990s, IPC performance had become the achilles heel of microkernels: The typical cost for a one-way message was around 100us, which was too high for building performant systems. This resulted in a trend to move core services back into the kernel [Condict et al. 1994]. ... Twenty years ago, Liedtke [1993a] demonstrated with his L4 kernel that microkernel IPC could be fast, a factor 10–20 faster than contemporary microkernels. " [2]

" The asynchronous in-kernel-buffering process communication concept used in Mach turned out to be one of the main reasons for its poor performance. ... Detailed analysis of the Mach bottleneck indicated that, among other things, its working set is too large: the IPC code expresses poor spatial locality; that is, it results in too many cache misses, of which most are in-kernel.[3] This analysis gave rise to the principle that an efficient microkernel should be small enough that the majority of performance-critical code fits into the (first-level) cache (preferably a small fraction of said cache).


Instead of Mach's complex IPC system, (Jochen Liedtke's) L3 microkernel simply passed the message without any additional overhead.


After some experience using L3, Liedtke came to the conclusion that several other Mach concepts were also misplaced. By simplifying the microkernel concepts even further he developed the first L4 kernel which was primarily designed with high performance in mind. In order to wring out every bit of performance the entire kernel was written in assembly language, and its IPC was 20 times faster than Mach's. ... "

"...all of them have a scheduler in the kernel, which implements a particular scheduling policy (usually hard-priority round robin)." [3]


"We mentioned earlier the importance of IPC performance and that the design and im- plementation of L4 kernels consistently aimed at maximising it. However, the details have evolved considerably.

3.2.1. Synchronous IPC. The original L4 supported synchronous (rendezvous-style) IPC (((the sender blocks))) as the only...mechanism. Synchronous IPC avoids buffering in the kernel and the management and copying cost associated with it. In fact, in its simplest version (short messages passed in registers) it is nothing but a context switch that leaves the message registers untouched...typical L4 implementations have IPC costs that are only 10% to 20% above the hardware limit (defined as the cost of two mode switches, a switch of page tables, plus saving and restoring addressing context and user-visible processor state)...

While certainly minimal, and simple conceptually and in implementation, experience taught us significant drawbacks of this model: It forces a multithreaded design onto otherwise simple systems, with the resulting synchronisation complexities. For example, the lack of functionality similar to UNIX select() required separate threads per interrupt source, and a single-threaded server could not wait for client requests and interrupts at the same time.

Furthermore, synchronous message passing is clearly the wrong way of synchronising activities across processor cores. On a single processor, communication between threads requires that (eventually) a context switch happens, and combining the context switch with communication minimises overheads. Consequently, the classical L4 IPC model is that of a user-controlled context switch that bypasses the scheduler; some payload is delivered through nonswitched registers, and further optional payload by kernel copy. On hardware that supports true parallelism, an RPC-like server invocation sequentialises client and server, which should be avoided if they are running on separate cores... " [4]


" We addressed this in L4-embedded by adding notifications , a simple, nonblocking signalling mechanism. We later refined this model in seL4’s notification objects

A no- tification contains a set of flags, the notification word , which is essentially an array of binary semaphores. A signal operation on a notification object sets a subset of the flags without blocking. The notification word can be checked by polling or by waiting (blocking) for a signal—effectively select() across the notification word.

Our present design provides another feature aimed at reducing the need for multi- threaded code, unifying waiting for IPC and notifications. For example, a file server might have an IPC interface for client requests, as well as a notification used by the disk driver to signal I/O completion. The unification feature binds a notification ob- ject to a thread (the server of the above example). If a notification is signalled while the thread is waiting for a message, the notification is converted into a single-word message and delivered to the thread (with an indication that it is really a notification). Notifications are not an introduction of asynchronous IPC through the backdoor but rather a (partial) decoupling of synchronisation from communication. While strictly not minimal (in that they add no functionality that could not be emulated with other mechanisms), they are essential for exploiting concurrency of the hardware.

In summary, like most other L4 kernels, seL4 retains the model of synchronous IPC but augments it with semaphore-like notifications. OKL4 has completely abandoned synchronous IPC and replaced it with virtual IRQs (similar to notifications). NOVA has augmented synchronous IPC with counting semaphores [Steinberg and Kauer 2010], while Fiasco.OC has also augmented synchronous IPC with virtual IRQs. " [5]

virtual registers:

early L4 versions passed messages in CPU registers. But "Pistachio introduced the concept of virtual message registers (originally 64 and later a configuration option). The implementation mapped some of them to physical reg- isters, and the rest was contained in a per-thread pinned part of the address space. The pinning ensures register-like access without the possibility of a page fault. Inlined access functions hide the distinction between physical and memory-backed registers from the user. seL4 and Fiasco.OC continue to use this approach. The motivation is two-fold: virtual message registers greatly improve portability across architectures. Furthermore, they reduce the performance penalty for moder- ately sized messages exceeding the number of physical registers " [6]

long messages (dropped):

" In original L4, “long” messages could specify multiple buffers in a single IPC invo- cation to amortise the hardware mode- and context-switch costs...required the kernel to handle nested practice it was rarely used: Shared buffers can avoid any explicit copying between address spaces and are generally preferred for bulk data transfer..." [7]. The need for nested exception handling (page fault exceptions, actually) in the kernel would have made verification of seL4 much harder.

IPC Destinations (dropped):

" Original L4 had threads as the targets of IPC operations ... Influenced by EROS [Shapiro et al. 1999], seL4 and Fiasco.OC [Lackorzynski and Warg 2009]) adopted IPC endpoints as IPC destinations. seL4 endpoints are essentially ports: The root of the queue of pending senders or receivers is a now a separate kernel object, instead of being part of the recipient’s thread control block (TCB). Unlike Mach ports [Accetta et al. 1986], IPC endpoints do not provide any buffering. ... In order to help servers identify clients without requiring per-client endpoints, seL4 provides badged capabilities , similar to the distinguished capabilities of KeyKOS? [Bromberger et al. 1992]. Capabilities with different badges but derived from the same original capability refer to the same (endpoint) object but on invokation deliver to the receiver the badge as an identification of the sender. "

IPC timeouts (dropped):

" A blocking IPC mechanism creates opportunities for denial-of- service (DOS) attacks. For example, a malicious (or buggy) client could send a request to a server without ever attempting to collect the reply; owing to the rendezvous-style IPC, the sender would block indefinitely unless it implements a watchdog to abort and restart...

To protect against such attacks, IPC operations in the original L4 had timeouts. Specifically, an IPC syscall specified four timeouts: one to limit blocking until start of the send phase, one to limit blocking in the receive phase, and two more to limit blocking on page faults during the send and receive phases (of long IPC).

Timeout values were encoded in a floating-point format that supported the values of zero, infinity, and finite values ranging from one millisecond to weeks. They added complexity for managing wakeup lists.

Practically, however, timeouts were of little use as a DOS defence. There is no theory, or even good heuristics, for choosing timeout values in a nontrivial system, and in practice, only the values zero and infinity were used: A client sends and receives with infinite timeouts, while a server waits for a request with an infinite but replies with a zero timeout. 7 (7: The client uses an RPC-style call operation, consisting of a send followed by an atomic switch to a receive phase, guaranteeing that the client is ready to receive the server’s reply)

Traditional watchdog timers represent a better approach to detecting unresponsive IPC interactions (e.g., resulting from deadlocks).

Having abandoned long IPC, in L4-embedded we replaced timeouts by a single flag supporting a choice of polling (zero timeout) or blocking (infinite timeout). Only two flags are needed, one for the send and one for the receive phase. seL4 follows this model. A fully asynchronous model, such as that of OKL4, is incompatible with time- outs and has no DOS issues that would require them.

Timeouts could also be used for timed sleeps by waiting on a message from a non- existing thread, a feature useful in real-time system. Dresden experimented with ex- tensions, including absolute timeouts, which expire at a particular wall clock time rather than relative to the commencement of the system call. Our approach is to give userland access to a (physical or virtual) timer "

L4 implementations

"He unveiled the 12-Kbyte fast L4 microkernel after 1996 while working for IBM in New York City. ... The L4 developer community is very active and a modern implementation of L4 in C++ is the L4Ka::Pistachio microkernel " [8] (2005)

"Karlsruhe’s experience with Version X and Hazelnut resulted in a major ABI revision, V4, aimed at improving kernel and application portability and multiprocessor support and addressing various other shortcomings. After Liedtke’s tragic death in 2001, his students implemented the design in a new open-source kernel, L4Ka::Pistachio (“Pistachio” for short)." [9]

"At NICTA we then retargeted Pistachio for use in resource-constrained embedded systems, resulting in a fork called NICTA::Pistachio-embedded (“L4-embedded”). It saw massive-scale commercial deployment when Qualcomm adopted it as a protected-mode real-time OS for the firmware of their wireless modem processors. It is now running on the security processor of all recent Apple iOS devices" [10]

" The influence of KeyKOS? [Hardy 1985] and EROS [Shapiro et al. 1999] and an increased focus on security resulted in the adoption of capabilities [Dennis and Van Horn 1966] for access control, first with the 2.1 release of OKL4 (2008) and soon followed by Fiasco; Fiasco was renamed Fiasco.OC in reference to its use of object capabilities. Aiming for formal verification, which seemed infeasible for a code base not designed for the purpose, we instead opted for a from-scratch implementation for our capability-based seL4 kernel. " [11]

"seL4’s SLOC count is somewhat bloated as a consequence of the C code being mostly a “blind” manual translation from Haskell [Klein et al. 2009], together with generated bit-field accessor functions, resulting in hundreds of small functions. The kernel compiles into about 9 k ARM instructions" [12]

" Later the UNSW group, at their new home at NICTA, forked L4Ka::Pistachio into a new L4 version called NICTA::L4-embedded. As the name implies, this was aimed at use in commercial embedded systems, and consequently the implementation trade-offs favored small memory footprints and aimed to reduce complexity. The API was modified to keep almost all system calls short enough that they do not need preemption points to ensure high real-time responsiveness.[10] " [13]

" On 29 July 2014, NICTA and General Dynamics C4 Systems announced that seL4, with end to end proofs, was now released under open source licenses.[20] The kernel source and proofs are under GPLv2, and most libraries and tools are under the 2-clause BSD license. " [14]

" F9 microkernel, a BSD-licensed L4 implementation, is built from scratch for deeply embedded devices with performance on ARM Cortex-M3/M4 processors, power consumption, and memory protection in mind. " [15]

F9 has about 3k LOC [16]

"the most general principles behind L4, minimality, including running device drivers at user level, generality, and a strong focus on performance, still remain relevant and foremost in the minds of developers. Specifically we find that the key microkernel per- formance metric, IPC latency, has remained essentially unchanged, in terms of clock cycles, as far as comparisons across vastly different ISAs and micro architectures have any validity. This is in stark contrast to the trend identified by Ousterhout [1990] just a few years before L4 was created. Furthermore, and maybe most surprisingly, the code size has essentially remained constant, a rather unusual development in software sys- tems" [17]

[18] section 2.2 implies that the most relevant current implementations are seL4 and Fiasco.OC, and NOVA (and a proprietary one, but that's not useful for me).

so sounds like mb the ones for me to look at are: seL4, L4Ka::Pistachio, and F9, Fiasco.OC

L4 seL4

"We have throughout commented on the influence verification had on seL4’s design and implementation. Specifically, verification prohibited or discouraged — a weak and unprincipled resource management model (Section 3.4.3) — configurable variants, such as pluggable schedulers (Section 3.4.4) — nested exceptions (Section 4.1) — nonstandard calling conventions (Section 4.6) — assembly code (Section 4.7) — concurrency and locking (Section 5.2). ... verification imposed some coding discipline on the implementors. Al- most all of this is in line with good software-engineering practice and thus is not really restrictive. The main exception is the prohibition on passing references to automatic variables to functions [Klein et al. 2009]. This is not an inherent requirement of ver- ification, but a tradeoff between verification and implementation effort. It is the sole drawback of verification we observed to date and could be removed by more investment into the verification infrastructure. ... Several design deci- sions, such as the simplified message structure, user-level control of kernel memory, and the approach to multicores are strongly influenced by verification. It also impacted a number of implementation approaches, such as the use of an event-oriented kernel, adoption of standard calling conventions, and the choice of C as the implementation language ... The first L4 kernel written completely in a high-level language was Fiasco. The developers chose C++ rather than C...The Karlsruhe team also chose C++ for Pistachio, mostly to support portability...for seL4, the requirements of verification forced the choice of C. " [19]

" In seL4, the architecture-agnostic code (between x86 and ARM) only accounts for about 50%. About half the code deals with virtual memory management, which is necessarily architecture-specific. The lower fraction of portable code is a result of seL4’s overall smaller size, with most (architecture-agnostic) resource-management code moved to userland. There is little architecture-specific optimisation except for the IPC fastpath. Steinberg [2013] similarly estimates a 50% rewri " [20]

"the event-based kernel no longer requires saving and restoring the C calling convention on a stack switch ... It is therefore essential to avoid long-running kernel operations and to use preemption points where this is not pos- sible, specifically the practically unbounded object deletion. We put significant effort into placement of preemption points, as well as on data structures and algorithms that minimises the need for them [Blackham et al. 2012]. We note that a continuation-based ... event kernel, such as seL4, provides natural support for preemption points by making them continuation points. " [21]

L4 lazy scheduling

" In the rendezvous model of IPC, a thread’s state frequently alternates between runnable and blocked. This implies frequent queue manipulations, moving a thread into and out of the ready queue, often many times within a time slice. Liedtke’s lazy scheduling trick minimises these queue manipulations: When a thread blocks on an IPC operation, the kernel updates its state in the TCB but leaves the thread in the ready queue, with the expectation it will unblock soon. When the scheduler is invoked upon a time-slice preemption, it traverses the ready queue un- til it finds a thread that is really runnable and removes the ones that are not. The approach was complemented by lazy updates of wakeup queues. Lazy scheduling moves work from the high-frequency IPC operation to the less fre- quently invoked scheduler. We observed the drawback when analysing seL4’s worst- case execution time (WCET) for enabling its use in hard real-time systems [Blackham et al. 2012]: The execution time of the scheduler is only bounded by the number of threads in the system.

To address the issue, we adopted an alternative optimisation, referred to as Benno scheduling , which does not suffer from pathological timing behaviour: Instead of leaving blocked threads in the ready queue, we defer entering unblocked threads into the ready queue until preemption time. This changes the main scheduler invariant from “all runnable threads are in the ready queue” to “the set of runnable threads consists of the currently executing thread plus the content of the ready queue.” Benno scheduling retains the desirable property that the ready queue usually does not get modified when threads block or unblock during IPC. At preemption time, the kernel inserts the (still runnable but no longer executing) preempted thread into the ready queue. This constant-time operation is the only fix-up needed. In addition, the removal of timeouts means that there are no more wakeup queues to manipulate. Endpoint wait queues must be strictly maintained, but in the common case of a server responding to client requests received via a single endpoint, they are hot in the cache, so the cost of those queue manipulations is low. This approach has similar average-case performance as lazy scheduling, while also having a bounded and small WCET. " [22]

L4 process vs. event-based kernel

"The original L4 had a separate kernel stack for each thread...The many kernel stacks dominate the per-thread mem- ory overhead, and they also increase the kernel’s cache footprint...The kernel’s memory use became a significant issue when L4 was gaining traction in the embedded space, so the design needed revisiting...required handling nested exceptions, that is, page faults triggered while in the kernel. It thus adds significant kernel complexity and massive challenges to formal verification...

an event-based (single-stack) kernel with continuations...showed a reduction in kernel memory consumption and improvements in IPC performance...the event kernel’s per-thread memory use was a quarter of that of the process kernel, despite the event kernel requir- ing more than twice the TCB size of the process kernel for storing the continuations....Haeberlen 2003. Warton 2005 " [23]

L4 user-level device drivers

L4 has user-level device drivers, except for "A small number of drivers are still best kept in the kernel. In a modern L4 kernel this typically means a timer driver, used for preempting user processes at the end of their time slice, and a driver for the interrupt controller, which is required to safely distribute interrupts to user-level processes." [24]

" The user-level driver model is tightly coupled with modelling interrupts as IPC messages, which the kernel sends to the driver. Details of the model (IPC from a vir- tual thread vs upcall), as well as the association and acknowledgment protocol, have changed over the years (changed back and back again), but the general approach still applies. The most notable change was moving from IPC to notifications as the mechanism for interrupt delivery. The main driver here was implementation simplification, as delivery as messages required the emulation of virtual in-kernel threads as the sources of interrupt IPC, while signalling notifications is a natural match to what the hardware does. " [25]

L4 lessons

Concluding Table IV. Summary of What Did and Did Not Last the Distance from L4 Microkernels: The Lessons from 20 Years of Research and Deployment:

kept principals:

new ideas:

stuff that was removed (excepting one item with agrees=no):

implementation stuff that was removed (excepting one item with agrees=no):

elsewhere in the paper they also mention uncontroversial design decisions and implementation tricks, "such as the send-receive combinations in a single system call: the client-style call for an RPC-like invocation and the server-style reply-and-wait"

L4 misc

ISAs for L4

" A common thread throughout those two decades is the minimality principle , intro- duced in Section 3.1, and a strong focus on the performance of the critical IPC oper- ation: kernel implementors generally aim to stay close to the limits set by the micro architecture, as shown in Table I. Consequently, the L4 community tends to measure IPC latencies in cycles rather than microseconds, as this better relates to the hardware limits. In fact, the table provides an interesting view of the context-switch-friendliness of the hardware: compare the cycle counts for Pentium 4 and Itanium, both from highly optimised IPC implementations on contemporary architectures.

Table I. One-Way Cross-Address-Space IPC Cost of Various L4 Kernels Name Year Architecture Processor MHz Cycles μs Original 1993 486 DX50 50 250 5.00 Original 1997 x86 (32-bit) Pentium 160 121 0.75 L4/MIPS 1997 MIPS64 R4700 100 100 1.00 L4/Alpha 1997 Alpha 21064 433 70–80 0.17 Hazelnut 2002 x86 (32-bit) Pentium II 400 273 0.68 Hazelnut 2002 x86 (32-bit) Pentium 4 1,400 2,000 1.38 Pistachio 2005 IA-64 Itanium 2 1,500 36 0.02 OKL4 2007 ARM v5 XScale 255 400 151 0.64 NOVA 2010 x86 (32-bit) Bloomfield 2,660 288 0.11 seL4 2013 x86 (32-bit) Haswell 3,400 301 0.09 seL4 2013 ARMv6 ARM11 532 188 0.35 seL4 2013 ARMv7 Cortex A9 1,000 316 0.32

... Table I shows the enormous influence of (micro-)architecture on context-switching and thus IPC costs. In terms of cycles, these kept increasing on x86, culminating in the 2,000-cycle IPC on the Pentium 4 (NetBurst?) architecture around the turn of the cen- tury. This created real challenges for microkernels: Our experience is that context switches that cost a few hundred cycles rarely ever matter, while thousands of cy- cles become a system performance issue, for example, with user-level drivers for high- bandwidth network interfaces. Fortunately, with the increased focus on power consumption, the architectural trend reversed: the latest-generation x86 processors, beginning with the Haswell microar- chitecture, enable context-switch times at par with RISC processors. " [26]

L4 Recursive page mappings

Some but not all variants of L4 provide recursive page mappings:

" Having a valid mapping of a page in its address space, a process had the right to map the page into another address space. Instead of mapping, a process could grant one of its pages, which removed the page, and any authority over it, from the grantor. A mapping, but not a grant, could be revoked by an unmap operation. " [27]


" 25% to 50% of kernel memory was consumed by the mapping database even without malicious processes ... allows two colluding processes to force the kernel to consume large amounts of memory by recursively mapping the same frame to different pages in each other’s address space, a potential DOS attack especially on 64-bit hardware, which can only prevented by controlling IPC (via the dreaded clans- and-chiefs, see Section 3.2.5). ... We replaced it by a model that more closely mirrors hardware, where mappings always originate from ranges of physical memory frames. ... Multiple approaches: Some L4 kernels retain the model of recursive address-space construction, while seL4 and OKL4 originate mappings from frames " [28]

L4 Kernel memory security against DOS

" While capabilities provide a clean and elegant model for delega tion, by themselves they do not solve the problem of resource management. A single malicious thread with grant right on a mapping can still use this to create a large number of mappings, forcing the kernel to consume large amounts of memory for metadata, and potentially DOS-ing the system ... Kernels that manage memory as a cache of user-level content only partially address this problem. While caching-based approaches remove the opportunity for DOS attacks based on memory exhaustion, they do not enable the strict isolation of kernel memory that is a prerequisite for performance isolation or real-time systems and likely intro duce side channels.

Liedtke et al. [1997b] examined this issue and proposed per-process kernel heaps... Per-process kernel heaps simplify user level, by removing control of allocation, at the expense of foregoing the ability to revoke allocations without destroying the process, and the ability to reason directly about allocated memory, as opposed to just bounding it. The tradeoff is still being explored in the community

We take a substantially different approach with seL4; its model for managing kernel memory is seL4’s main contribution to OS design. Motivated by the desire to reason about resource usage and isolation, we subject all kernel memory to authority conveyed by capabilities. The only exception is the fixed amount of memory used by the kernel to boot up, including its strictly bounded stack. Specifically, we completely remove the kernel heap and provide userland with a mechanism to identify authorised kernel memory whenever the kernel allocates data structures. A side effect is that this reduces the size and complexity of the kernel, a major bonus to verification.

The key is making all kernel objects explicit and subject to capability-based access control. This approach is inspired by hardware-based capability systems, specifically CAP [Needham and Walker 1977], where hardware-interpreted capabilities directly refer to memory. HiStar? [Zeldovich et al. 2011] also makes all kernel objects explicit, though it takes a caching approach to memory management.

Of course, user-visible kernel objects do not mean that someone with authority over a kernel object can directly read or write it. The capability provides the right to invoke (a subset of) object-specific methods, which includes destruction of the object. (Objects, once created, never change their size.) Crucially, the kernel object types include unused memory, called Untyped in seL4, which can be used to create other objects.

Specifically, the only operation possible on Untyped is to retype part of it into some object type. The relationship of the new object to the original Untyped is recorded in a capability derivation tree , which also records other kinds of capability derivation, such as the creation of capability copies with reduced privileges. Once some Untyped has been retyped, the only operation possible on the (corresponding part of) the original Untyped is to revoke the derived object (see below).

Retyping is the only way to create objects in seL4. Hence, by limiting access to Untyped memory, a system can control resource allocation. Retyping can also produce smaller Untyped objects, which can then be independently managed—this is key to delegating resource management. The derivation from Untyped also ensures the kernel integrity property that no two typed objects overlap. ...

There is one long-running operation, capability revocation, which requires preemption points. These ensure that the kernel is in a consistent state, that it has made progress, and to check for pending interrupts. If there are pending interrupts, then the kernel returns to usermode to restart the system call, which ensures that the interrupt gets handled first. The restarted system call continues the tear-down operation where the previous attempt was discontinued.


Objects are revoked by invoking the revoke() method on an Untyped object further up the tree; this will remove all capabilities to all objects derived from that Untyped . When the last capability to an object is removed, the object itself is deleted. This removes any in-kernel dependencies it may have with other objects, thus making it available for reuse. Removal of the last capability is easy to detect, as it cleans up the last leaf node in the capability tree referring to a particular memory location. " [29]

" Table III gives the complete set of seL4 object types and their use for 32-bit ARM pro- cessors, x86 is very similar. Userland can only directly access (load/store/fetch) memory corresponding to a Frame that is mapped in its address space by inserting the Frame capability into a Page Table.

Table III. seL4 Kernel Objects for 32-bit ARM Processors

" [30]

L4 process hierarchy (dropped)

"L4 does not have a first-class process concept, it is a higher- level abstraction that consists of an address space, represented by a page table, and a number of associated threads. These consume kernel resources, and unchecked alloca- tion of TCBs and page tables could easily lead to denial of service. Original L4 dealt with that through a process hierarchy: “Task IDs” were essentially capabilities over address spaces, allowing creation and deletion. " [31]

seL4 toread for me

section 2.1 of

section 2.1 of

seL4 manual:

L4 links




" Capability-based security supposedly makes security easy to use by providing an intuitive way to manage authority without the need for an all-encompassing and complex global system policy. Kernelization of software components aids the deconstruction of complex software into low-complexity security-sensitive parts and high-complexity parts. The lat- ter no longer need to be considered as part of the trusted computing base. Virtualization can bridge the gap between applications that expect current-generation OSes and a new operating-system design. The management of budgets within hierarchical organizations shows how limited resources can be utilized and still be properly accounted for. None of those techniques is new by any means. However, they have never been used as a composition of a general-purpose operating system. This is where Genode comes into the picture. Application-specific trusted computing base A Genode system is structured as a tree of components where each component (except for the root of the tree) is owned by its parent. The notion of ownership means both responsibility and control. Being responsible for its children, the parent has to explicitly provide the resources needed by its children out of its own resources. It is also responsible to acquaint children with one another and the outside world. In return, the parent retains ultimate control over each of its children. As the owner of a child, it has ultimate power over the child’s envi- ronment, the child’s view of the system, and the lifetime of the child. Each child can, in turn, have children, which yields a recursive system structure. Figure 1 illustrates the idea. At the root of the tree, there is a low-complexity microkernel that is always part of the TCB. The kernel is solely responsible to provide protection domains, threads of execution, and the controlled communication between protection domains. All other system functions such as device drivers, network stacks, file systems, runtime environ- ments, virtual machines, security functions, and resource multiplexers are realized as components within the tree "

" An RPC object provides a remote-procedure call (RPC) interface. Similar to a regular object, an RPC object can be constructed and accessed from within the same program. But in contrast to a regular object, it can also be called from the outside of the compo- nent. What a pointer is to a regular object, a capability is to an RPC object. It is a token that unambiguously refers to an RPC object. In the following, we represent an RPC object as follows ... However, there are two important differences between a capability and a pointer. First, in contrast to a pointer that can be created out of thin air (e. g., by casting an arbitrary number to a pointer), a capability cannot be created without an RPC object. At the creation time of an RPC object, Genode creates a so-called object identity that represents the RPC object in the kernel ... For each protection domain, the kernel maintains a so-called capability space, which is a name space that is local to the protection domain. At the creation time of an RPC object, the kernel creates a corresponding object identity and lets a slot in the protection domain’s capability space refer to the RPC object’s identity. From the component’s point of view, the RPC object A has the name 3. When interacting with the kernel, the component can use this number to refer to the RPC object A. ... Whenever the kernel delegates a capability from one to another protection domain, it inserts a reference to the RPC object’s identity into a free slot of the target’s capability space. Within protection domain 2 shown in Figure 3, the RPC object can be referred to by the number 5. Within protection domain 3, the same RPC object is known as 2. Note that the capability delegation does not hand over the ownership of the object identity to the target protection domain. The ownership is always retained by the protection domain that created the RPC object. "



Zircon is the core platform that powers the Fuchsia OS. Zircon is composed of a microkernel (source in kernel/...) as well as a small set of userspace services, drivers, and libraries (source in system/...) necessary for the system to boot, talk to hardware, load userspace processes and run them, etc. Fuchsia builds a much larger OS on top of this foundation.

The canonical Zircon Git repository is located at:

A read-only mirror of the code is present at:

The Zircon Kernel provides syscalls to manage processes, threads, virtual memory, inter-process communication, waiting on object state changes, and locking (via futexes).

Currently there are some temporary syscalls that have been used for early bringup work, which will be going away in the future as the long term syscall API/ABI surface is finalized. The expectation is that there will be about 100 syscalls.

Zircon syscalls are generally non-blocking. The wait_one, wait_many port_wait and thread sleep being the notable exceptions.


akavel 1 day ago [-]

Personally, in my heart, I'm kinda keeping fingers crossed for a future where if one needs to/has to, one uses Rust (mostly where one would use C/C++ today, including for "squeezing the last drop of performance", and especially high risk everpresent libraries like libpng; plus as an extra, probably for highly ambitious parallelism like in Servo); otherwise, one just goes with Go (for the amazing general productivity and readability).

Though then there's also Luna-lang, which I'm hoping even stronger to become long-term a new breakthrough paradigm/tech in programming & scripting...


" When reading this I think of my deep knowledge of TCP (especially the pain points) and yet it is "The Internet" for the past while. Compared to simpler protocols like Delta-T and later/better protocols like XTP, TCP still "won" for a variety of reasons (adoption, good enough, numbers of implementation, hardware implementation, etc) "


" The IO subsystem is architected very differently in Windows than it is in Linux, with different goals in mind. Some of the major differences we have to deal with are:

    Linux has a top-level directory entry cache that means that certain queries (most notably stat calls) can be serviced without calling into the file system at all once an item is in the cache. Windows has no such cache, and leaves much more up to the file systems. A Win32 path like C:\dir\file gets translated to an NT path like \??\C:\dir\file, where \??\C: is a symlink in Object Manager to a device object like \Device\HarddiskVolume4. Once such a device object is encountered, the entire remainder of the path is just passed to the file system, which is very different to the centralized path parsing that VFS does in Linux.
    Windows's IO stack is extensible, allowing filter drivers to attach to volumes and intercept IO requests before the file system sees them. This is used for numerous things, including virus scanning, compression, encryption, file virtualization, things like OneDrive's files on demand feature, gathering pre-fetching data to speed up app startup, and much more. Even a clean install of Windows will have a number of filters present, particularly on the system volume (so if you have a D: drive or partition, I recommend using that instead, since it likely has fewer filters attached). Filters are involved in many IO operations, most notably creating/opening files.
    The NT file system API is designed around handles, not paths. Almost any operation requires opening the file first, which can be expensive. Even things that on the Win32 level seem to be a single call (e.g. DeleteFile) actually open and close the file under the hood. One of our biggest performance optimizations for DrvFs which we did several releases ago was the introduction of a new API that allows us to query file information without having to open it first.

Whether we like it or not (and we don't), file operations in Windows are more expensive than in Linux, even more so for those operations that only touch file metadata (such as stat). "

herf 6 hours ago [-]

I spent many years optimizing "stat-like" APIs for Picasa - Windows just feels very different than Linux once you're benchmarking.

It turns out Windows/SMB is very good at "give me all metadata over the wire for a directory" and not so fast at single file stat performance. On a high-latency network (e.g. Wi-Fi) the Windows approach is faster, but on a local disk (e.g., compiling code), Linux stat is faster.