notes-computer-systems-greenField

Difference between revision 30 and current revision

No diff available.

OS

"To be a viable computer system, one must honor a huge list of large, and often changing, standards: TCP/IP, HTTP, HTML, XML, CORBA, Unicode, POSIX, NFS, SMB, MIME, POP, IMAP, X, ... A huge amount of work, but if you don’t honor the standards you’re marginalized. Estimate that 90-95% of the work in Plan 9 was directly or indirectly to honor externally imposed standards." -- http://herpolhode.com/rob/utah2000.pdf (Rob Pike talk)


MS singularity:


Plan 9 notes:

syscalls from http://man.cat-v.org/plan_9/2/intro :

bind open read dup fork seek stat remove fd2path getwd pipe exec chdir segattach exits wait sleep notify lock errstr

9P protocol messages from http://man.cat-v.org/plan_9/5/intro :

version auth error flush attach walk open create read write clunk remove stat wstat

version auth error flush clunk are protocol metastuff (flush aborts a transaction, clunk deallocates a file descriptor), leaving:

ddevault on Jan 28, 2018 [-]

The kernel is simple and sanely designed and interfacing with it is done through the filesystem in, again, a very simple and sane way. Your displays are configured by a simple <10 line shell script at boot time that writes plaintext to a device file, rather than a gargantuan graphics stack that's controlled with binary device files and ioctls. Filesystem namespaces are lovely, and make the filesystem cleanly organized and customized to the logged in user's needs and desires. I have a little script that clears my current terminal window: `echo -n "" > /dev/text`. The libc is small and focused (non-POSIX), and there are only a handful of syscalls. The process model is really sane and straightforward as well. Playing an mp3 file is `mp3dec < file.mp3 > /dev/audio`.

To open a TCP connection, you use the dial(3) function, which basically does the following: write "tcp!name!port" to /net/cs and read out "1.2.3.4!80" (/net/cs resolves the name), then you write "connect 1.2.3.4" to /net/tcp/clone and read out a connection ID, and open /net/tcp/:id/data which is now a full-duplex TCP stream.

There's this emphasis on simple, sane ways of fulfilling tasks on plan9 that permeates the whole system. It's beautiful.

https://thedorkweb.substack.com/p/a-week-with-plan-9


https://lwn.net/SubscriberLink/718267/206c8a5fbf0ee2ea/ https://news.ycombinator.com/item?id=14002386

" Fuchsia: a new operating system

Nur Hussein

Fuchsia is a new operating system being built more or less from scratch at Google. ... At the heart of Fuchsia is the Magenta microkernel... LK, the kernel that Magenta builds upon, was created by Fuchsia developer Travis Geiselbrecht before he joined Google. LK's goal is to be a small kernel that runs on resource-constrained tiny embedded systems (in the same vein as FreeRTOS? or ThreadX?). Magenta, on the other hand, targets more sophisticated hardware (a 64-bit CPU with a memory-management unit is required to run it), and thus expands upon LK's limited features. Magenta uses LK's "inner constructs", which is comprised of threads, mutexes, timers, events (signals), wait queues, semaphores, and a virtual memory manager (VMM). For Magenta, LK's VMM has been substantially improved upon.

One of the key design features of Magenta is the use of capabilities....Capabilities are implemented in Magenta by the use of constructs called handles....Almost all system calls require that a handle be passed to them. Handles have rights associated with them... The rights that can be granted to a handle are for reading or writing to the associated kernel object or, in the case of a virtual memory object, whether or not it can be mapped as executable....Since memory is treated as a resource that is accessed via kernel objects, processes gain use of memory via handles. Creating a process in Fuchsia means a creator process (such as a shell) must do the work of creating virtual memory objects manually for the child process. This is different from traditional Unix-like kernels such as Linux, where the kernel does the bulk of the virtual memory setup for processes automatically. Magenta's virtual memory objects can map memory in any number of ways, and a lot of flexibility is given to processes to do so. One can even imagine a scenario where memory isn't mapped at all, but can still be read or written to via its handle like a file descriptor. While this setup allows for all kinds of creative uses, it also means that a lot of the scaffolding work for processes to run must be done by the user-space environment.

Since Magenta was designed as a microkernel, most of the operating system's major functional components also run as user-space processes. This include the drivers, network stack, and filesystems. The network stack was originally bootstrapped from lwIP, but eventually it was replaced by a custom network stack written by the Fuchsia team. The network stack is an application that sits between the user-space network drivers and the application that requests network services. A BSD socket API is provided by the network stack.

The default Fuchsia filesystem, called minfs, was also built from scratch... since the filesystems run as user-space servers, accessing them is done via a protocol to those servers....The user-space C libraries make the protocol transparent to user programs, which will just make calls to open, close, read, and write files. ... Full POSIX compatibility is not a goal for the Fuchsia project; enough POSIX compatibility is provided via the C library, which is a port of the musl project to Fuchsia..."

https://fuchsia.googlesource.com/docs/+/master/the-book/

Khaine 13 hours ago [-]

One thing I don't see addressed in the README is why? Why do we need Fuchsia? What problem are we trying to solve? Why should I use/develop for it instead of Windows/Linux/macOS?

Or is this just a research operating system designed to test new ideas out?

reply

akavel 12 hours ago [-]

The main keywords are "capability based" and "microkernel". Those ideas bring numerous advantages over monolithic kernels (including Linux, Windows, macOS), especially humongous boost to protection against vulnerabilities, also better reliability and modularity. They are quite well researched already AFAIU, and apparently the time has come for them to start breaking through to "mainstream" (besides Fuchsia, see e.g. https://genode.org, https://redox-os.org)

Other than that, for Google this would obviously bring total control over the codebase, allowing them to do whatever they want, and super quickly, not needing to convince Linus or anybody else.

reply

naasking 21 hours ago [-]

Some problems I see from skimming the docs:

> Calls which have no limitations, of which there are only a very few, for example zx_clock_get() and zx_nanosleep() may be called by any thread.

Having the clock be an ambient authority leaves the system open to easy timing attacks via implicit covert channels. I'm glad these kinds of timing attacks have gotten more attention with Spectre and Meltdown. Capability security folks have been pointing these out for decades.

> Calls which create new Objects but do not take a Handle, such as zx_event_create() and zx_channel_create(). Access to these (and limitations upon them) is controlled by the Job in which the calling Process is contained.

I'm hesitant to endorse any system calls with ambient authority, even if it's scoped by context like these. It's far too easy to introduce subtle vulnerabilities. For instance, these calls seem to permit a Confused Deputy attack as long as two processes are running in the same Job.

Other notes on the kernel:

Looks like they'll also support private namespacing ala Plan 9, which is great. I hope we can get a robust OS to replace existing antiquated systems with Google's resources. This looks like a good start.

reply


overview from the Lisp Machine manual http://lispm.de/genera-concepts

---

Networking

Associated VMs


L4

" In this spirit, the L4 microkernel provides few basic mechanisms: address spaces (abstracting page tables and providing memory protection), threads and scheduling (abstracting execution and providing temporal protection), and inter-process communication (for controlled communication across isolation boundaries).

An operating system based on a microkernel like L4 provides services as servers in user space that monolithic kernels like Linux or older generation microkernels include internally. For example, in order to implement a secure Unix-like system, servers must provide the rights management that Mach included inside the kernel. " [1]

" Microkernels minimize the functionality that is provided by the kernel: The kernel provides a set of general mechanisms, while user-mode servers implement the actual operating system (OS) services [Brinch Hansen 1970; Levin et al. 1975]. Application code obtains a system service by communicating with servers via an interprocess com- munication (IPC) mechanism, typically message passing. Hence, IPC is on the critical path of any service invocation, and low IPC costs are essential.

By the early 1990s, IPC performance had become the achilles heel of microkernels: The typical cost for a one-way message was around 100us, which was too high for building performant systems. This resulted in a trend to move core services back into the kernel [Condict et al. 1994]. ... Twenty years ago, Liedtke [1993a] demonstrated with his L4 kernel that microkernel IPC could be fast, a factor 10–20 faster than contemporary microkernels. " [2]

" The asynchronous in-kernel-buffering process communication concept used in Mach turned out to be one of the main reasons for its poor performance. ... Detailed analysis of the Mach bottleneck indicated that, among other things, its working set is too large: the IPC code expresses poor spatial locality; that is, it results in too many cache misses, of which most are in-kernel.[3] This analysis gave rise to the principle that an efficient microkernel should be small enough that the majority of performance-critical code fits into the (first-level) cache (preferably a small fraction of said cache).

...

Instead of Mach's complex IPC system, (Jochen Liedtke's) L3 microkernel simply passed the message without any additional overhead.

...

After some experience using L3, Liedtke came to the conclusion that several other Mach concepts were also misplaced. By simplifying the microkernel concepts even further he developed the first L4 kernel which was primarily designed with high performance in mind. In order to wring out every bit of performance the entire kernel was written in assembly language, and its IPC was 20 times faster than Mach's. ... " https://en.wikipedia.org/wiki/L4_microkernel_family

"...all of them have a scheduler in the kernel, which implements a particular scheduling policy (usually hard-priority round robin)." [3]

L4 IPC

"We mentioned earlier the importance of IPC performance and that the design and im- plementation of L4 kernels consistently aimed at maximising it. However, the details have evolved considerably.

3.2.1. Synchronous IPC. The original L4 supported synchronous (rendezvous-style) IPC (((the sender blocks))) as the only...mechanism. Synchronous IPC avoids buffering in the kernel and the management and copying cost associated with it. In fact, in its simplest version (short messages passed in registers) it is nothing but a context switch that leaves the message registers untouched...typical L4 implementations have IPC costs that are only 10% to 20% above the hardware limit (defined as the cost of two mode switches, a switch of page tables, plus saving and restoring addressing context and user-visible processor state)...

While certainly minimal, and simple conceptually and in implementation, experience taught us significant drawbacks of this model: It forces a multithreaded design onto otherwise simple systems, with the resulting synchronisation complexities. For example, the lack of functionality similar to UNIX select() required separate threads per interrupt source, and a single-threaded server could not wait for client requests and interrupts at the same time.

Furthermore, synchronous message passing is clearly the wrong way of synchronising activities across processor cores. On a single processor, communication between threads requires that (eventually) a context switch happens, and combining the context switch with communication minimises overheads. Consequently, the classical L4 IPC model is that of a user-controlled context switch that bypasses the scheduler; some payload is delivered through nonswitched registers, and further optional payload by kernel copy. On hardware that supports true parallelism, an RPC-like server invocation sequentialises client and server, which should be avoided if they are running on separate cores... " [4]

notifications:

" We addressed this in L4-embedded by adding notifications , a simple, nonblocking signalling mechanism. We later refined this model in seL4’s notification objects

A no- tification contains a set of flags, the notification word , which is essentially an array of binary semaphores. A signal operation on a notification object sets a subset of the flags without blocking. The notification word can be checked by polling or by waiting (blocking) for a signal—effectively select() across the notification word.

Our present design provides another feature aimed at reducing the need for multi- threaded code, unifying waiting for IPC and notifications. For example, a file server might have an IPC interface for client requests, as well as a notification used by the disk driver to signal I/O completion. The unification feature binds a notification ob- ject to a thread (the server of the above example). If a notification is signalled while the thread is waiting for a message, the notification is converted into a single-word message and delivered to the thread (with an indication that it is really a notification). Notifications are not an introduction of asynchronous IPC through the backdoor but rather a (partial) decoupling of synchronisation from communication. While strictly not minimal (in that they add no functionality that could not be emulated with other mechanisms), they are essential for exploiting concurrency of the hardware.

In summary, like most other L4 kernels, seL4 retains the model of synchronous IPC but augments it with semaphore-like notifications. OKL4 has completely abandoned synchronous IPC and replaced it with virtual IRQs (similar to notifications). NOVA has augmented synchronous IPC with counting semaphores [Steinberg and Kauer 2010], while Fiasco.OC has also augmented synchronous IPC with virtual IRQs. " [5]

virtual registers:

early L4 versions passed messages in CPU registers. But "Pistachio introduced the concept of virtual message registers (originally 64 and later a configuration option). The implementation mapped some of them to physical reg- isters, and the rest was contained in a per-thread pinned part of the address space. The pinning ensures register-like access without the possibility of a page fault. Inlined access functions hide the distinction between physical and memory-backed registers from the user. seL4 and Fiasco.OC continue to use this approach. The motivation is two-fold: virtual message registers greatly improve portability across architectures. Furthermore, they reduce the performance penalty for moder- ately sized messages exceeding the number of physical registers " [6]

long messages (dropped):

" In original L4, “long” messages could specify multiple buffers in a single IPC invo- cation to amortise the hardware mode- and context-switch costs...required the kernel to handle nested exceptions...in practice it was rarely used: Shared buffers can avoid any explicit copying between address spaces and are generally preferred for bulk data transfer..." [7]. The need for nested exception handling (page fault exceptions, actually) in the kernel would have made verification of seL4 much harder.

IPC Destinations (dropped):

" Original L4 had threads as the targets of IPC operations ... Influenced by EROS [Shapiro et al. 1999], seL4 and Fiasco.OC [Lackorzynski and Warg 2009]) adopted IPC endpoints as IPC destinations. seL4 endpoints are essentially ports: The root of the queue of pending senders or receivers is a now a separate kernel object, instead of being part of the recipient’s thread control block (TCB). Unlike Mach ports [Accetta et al. 1986], IPC endpoints do not provide any buffering. ... In order to help servers identify clients without requiring per-client endpoints, seL4 provides badged capabilities , similar to the distinguished capabilities of KeyKOS? [Bromberger et al. 1992]. Capabilities with different badges but derived from the same original capability refer to the same (endpoint) object but on invokation deliver to the receiver the badge as an identification of the sender. "

IPC timeouts (dropped):

" A blocking IPC mechanism creates opportunities for denial-of- service (DOS) attacks. For example, a malicious (or buggy) client could send a request to a server without ever attempting to collect the reply; owing to the rendezvous-style IPC, the sender would block indefinitely unless it implements a watchdog to abort and restart...

To protect against such attacks, IPC operations in the original L4 had timeouts. Specifically, an IPC syscall specified four timeouts: one to limit blocking until start of the send phase, one to limit blocking in the receive phase, and two more to limit blocking on page faults during the send and receive phases (of long IPC).

Timeout values were encoded in a floating-point format that supported the values of zero, infinity, and finite values ranging from one millisecond to weeks. They added complexity for managing wakeup lists.

Practically, however, timeouts were of little use as a DOS defence. There is no theory, or even good heuristics, for choosing timeout values in a nontrivial system, and in practice, only the values zero and infinity were used: A client sends and receives with infinite timeouts, while a server waits for a request with an infinite but replies with a zero timeout. 7 (7: The client uses an RPC-style call operation, consisting of a send followed by an atomic switch to a receive phase, guaranteeing that the client is ready to receive the server’s reply)

Traditional watchdog timers represent a better approach to detecting unresponsive IPC interactions (e.g., resulting from deadlocks).

Having abandoned long IPC, in L4-embedded we replaced timeouts by a single flag supporting a choice of polling (zero timeout) or blocking (infinite timeout). Only two flags are needed, one for the send and one for the receive phase. seL4 follows this model. A fully asynchronous model, such as that of OKL4, is incompatible with time- outs and has no DOS issues that would require them.

Timeouts could also be used for timed sleeps by waiting on a message from a non- existing thread, a feature useful in real-time system. Dresden experimented with ex- tensions, including absolute timeouts, which expire at a particular wall clock time rather than relative to the commencement of the system call. Our approach is to give userland access to a (physical or virtual) timer "

L4 implementations

"He unveiled the 12-Kbyte fast L4 microkernel after 1996 while working for IBM in New York City. ... The L4 developer community is very active and a modern implementation of L4 in C++ is the L4Ka::Pistachio microkernel " [8] (2005)

"Karlsruhe’s experience with Version X and Hazelnut resulted in a major ABI revision, V4, aimed at improving kernel and application portability and multiprocessor support and addressing various other shortcomings. After Liedtke’s tragic death in 2001, his students implemented the design in a new open-source kernel, L4Ka::Pistachio (“Pistachio” for short)." [9]

"At NICTA we then retargeted Pistachio for use in resource-constrained embedded systems, resulting in a fork called NICTA::Pistachio-embedded (“L4-embedded”). It saw massive-scale commercial deployment when Qualcomm adopted it as a protected-mode real-time OS for the firmware of their wireless modem processors. It is now running on the security processor of all recent Apple iOS devices" [10]

" The influence of KeyKOS? [Hardy 1985] and EROS [Shapiro et al. 1999] and an increased focus on security resulted in the adoption of capabilities [Dennis and Van Horn 1966] for access control, first with the 2.1 release of OKL4 (2008) and soon followed by Fiasco; Fiasco was renamed Fiasco.OC in reference to its use of object capabilities. Aiming for formal verification, which seemed infeasible for a code base not designed for the purpose, we instead opted for a from-scratch implementation for our capability-based seL4 kernel. " [11]

"seL4’s SLOC count is somewhat bloated as a consequence of the C code being mostly a “blind” manual translation from Haskell [Klein et al. 2009], together with generated bit-field accessor functions, resulting in hundreds of small functions. The kernel compiles into about 9 k ARM instructions" [12]

" Later the UNSW group, at their new home at NICTA, forked L4Ka::Pistachio into a new L4 version called NICTA::L4-embedded. As the name implies, this was aimed at use in commercial embedded systems, and consequently the implementation trade-offs favored small memory footprints and aimed to reduce complexity. The API was modified to keep almost all system calls short enough that they do not need preemption points to ensure high real-time responsiveness.[10] " [13]

" On 29 July 2014, NICTA and General Dynamics C4 Systems announced that seL4, with end to end proofs, was now released under open source licenses.[20] The kernel source and proofs are under GPLv2, and most libraries and tools are under the 2-clause BSD license. " [14]

" F9 microkernel, a BSD-licensed L4 implementation, is built from scratch for deeply embedded devices with performance on ARM Cortex-M3/M4 processors, power consumption, and memory protection in mind. " [15]

F9 has about 3k LOC [16]

"the most general principles behind L4, minimality, including running device drivers at user level, generality, and a strong focus on performance, still remain relevant and foremost in the minds of developers. Specifically we find that the key microkernel per- formance metric, IPC latency, has remained essentially unchanged, in terms of clock cycles, as far as comparisons across vastly different ISAs and micro architectures have any validity. This is in stark contrast to the trend identified by Ousterhout [1990] just a few years before L4 was created. Furthermore, and maybe most surprisingly, the code size has essentially remained constant, a rather unusual development in software sys- tems" [17]

[18] section 2.2 implies that the most relevant current implementations are seL4 and Fiasco.OC, and NOVA (and a proprietary one, but that's not useful for me).

so sounds like mb the ones for me to look at are: seL4, L4Ka::Pistachio, and F9, Fiasco.OC

L4 seL4

"We have throughout commented on the influence verification had on seL4’s design and implementation. Specifically, verification prohibited or discouraged — a weak and unprincipled resource management model (Section 3.4.3) — configurable variants, such as pluggable schedulers (Section 3.4.4) — nested exceptions (Section 4.1) — nonstandard calling conventions (Section 4.6) — assembly code (Section 4.7) — concurrency and locking (Section 5.2). ... verification imposed some coding discipline on the implementors. Al- most all of this is in line with good software-engineering practice and thus is not really restrictive. The main exception is the prohibition on passing references to automatic variables to functions [Klein et al. 2009]. This is not an inherent requirement of ver- ification, but a tradeoff between verification and implementation effort. It is the sole drawback of verification we observed to date and could be removed by more investment into the verification infrastructure. ... Several design deci- sions, such as the simplified message structure, user-level control of kernel memory, and the approach to multicores are strongly influenced by verification. It also impacted a number of implementation approaches, such as the use of an event-oriented kernel, adoption of standard calling conventions, and the choice of C as the implementation language ... The first L4 kernel written completely in a high-level language was Fiasco. The developers chose C++ rather than C...The Karlsruhe team also chose C++ for Pistachio, mostly to support portability...for seL4, the requirements of verification forced the choice of C. " [19]

" In seL4, the architecture-agnostic code (between x86 and ARM) only accounts for about 50%. About half the code deals with virtual memory management, which is necessarily architecture-specific. The lower fraction of portable code is a result of seL4’s overall smaller size, with most (architecture-agnostic) resource-management code moved to userland. There is little architecture-specific optimisation except for the IPC fastpath. Steinberg [2013] similarly estimates a 50% rewri " [20]

"the event-based kernel no longer requires saving and restoring the C calling convention on a stack switch ... It is therefore essential to avoid long-running kernel operations and to use preemption points where this is not pos- sible, specifically the practically unbounded object deletion. We put significant effort into placement of preemption points, as well as on data structures and algorithms that minimises the need for them [Blackham et al. 2012]. We note that a continuation-based ... event kernel, such as seL4, provides natural support for preemption points by making them continuation points. " [21]

L4 lazy scheduling

" In the rendezvous model of IPC, a thread’s state frequently alternates between runnable and blocked. This implies frequent queue manipulations, moving a thread into and out of the ready queue, often many times within a time slice. Liedtke’s lazy scheduling trick minimises these queue manipulations: When a thread blocks on an IPC operation, the kernel updates its state in the TCB but leaves the thread in the ready queue, with the expectation it will unblock soon. When the scheduler is invoked upon a time-slice preemption, it traverses the ready queue un- til it finds a thread that is really runnable and removes the ones that are not. The approach was complemented by lazy updates of wakeup queues. Lazy scheduling moves work from the high-frequency IPC operation to the less fre- quently invoked scheduler. We observed the drawback when analysing seL4’s worst- case execution time (WCET) for enabling its use in hard real-time systems [Blackham et al. 2012]: The execution time of the scheduler is only bounded by the number of threads in the system.

To address the issue, we adopted an alternative optimisation, referred to as Benno scheduling , which does not suffer from pathological timing behaviour: Instead of leaving blocked threads in the ready queue, we defer entering unblocked threads into the ready queue until preemption time. This changes the main scheduler invariant from “all runnable threads are in the ready queue” to “the set of runnable threads consists of the currently executing thread plus the content of the ready queue.” Benno scheduling retains the desirable property that the ready queue usually does not get modified when threads block or unblock during IPC. At preemption time, the kernel inserts the (still runnable but no longer executing) preempted thread into the ready queue. This constant-time operation is the only fix-up needed. In addition, the removal of timeouts means that there are no more wakeup queues to manipulate. Endpoint wait queues must be strictly maintained, but in the common case of a server responding to client requests received via a single endpoint, they are hot in the cache, so the cost of those queue manipulations is low. This approach has similar average-case performance as lazy scheduling, while also having a bounded and small WCET. " [22]

L4 process vs. event-based kernel

"The original L4 had a separate kernel stack for each thread...The many kernel stacks dominate the per-thread mem- ory overhead, and they also increase the kernel’s cache footprint...The kernel’s memory use became a significant issue when L4 was gaining traction in the embedded space, so the design needed revisiting...required handling nested exceptions, that is, page faults triggered while in the kernel. It thus adds significant kernel complexity and massive challenges to formal verification...

an event-based (single-stack) kernel with continuations...showed a reduction in kernel memory consumption and improvements in IPC performance...the event kernel’s per-thread memory use was a quarter of that of the process kernel, despite the event kernel requir- ing more than twice the TCB size of the process kernel for storing the continuations....Haeberlen 2003. Warton 2005 " [23]

L4 user-level device drivers

L4 has user-level device drivers, except for "A small number of drivers are still best kept in the kernel. In a modern L4 kernel this typically means a timer driver, used for preempting user processes at the end of their time slice, and a driver for the interrupt controller, which is required to safely distribute interrupts to user-level processes." [24]

" The user-level driver model is tightly coupled with modelling interrupts as IPC messages, which the kernel sends to the driver. Details of the model (IPC from a vir- tual thread vs upcall), as well as the association and acknowledgment protocol, have changed over the years (changed back and back again), but the general approach still applies. The most notable change was moving from IPC to notifications as the mechanism for interrupt delivery. The main driver here was implementation simplification, as delivery as messages required the emulation of virtual in-kernel threads as the sources of interrupt IPC, while signalling notifications is a natural match to what the hardware does. " [25]

L4 lessons

Concluding Table IV. Summary of What Did and Did Not Last the Distance from L4 Microkernels: The Lessons from 20 Years of Research and Deployment:

kept principals:

new ideas:

stuff that was removed (excepting one item with agrees=no):

implementation stuff that was removed (excepting one item with agrees=no):

elsewhere in the paper they also mention uncontroversial design decisions and implementation tricks, "such as the send-receive combinations in a single system call: the client-style call for an RPC-like invocation and the server-style reply-and-wait"

L4 misc

ISAs for L4

" A common thread throughout those two decades is the minimality principle , intro- duced in Section 3.1, and a strong focus on the performance of the critical IPC oper- ation: kernel implementors generally aim to stay close to the limits set by the micro architecture, as shown in Table I. Consequently, the L4 community tends to measure IPC latencies in cycles rather than microseconds, as this better relates to the hardware limits. In fact, the table provides an interesting view of the context-switch-friendliness of the hardware: compare the cycle counts for Pentium 4 and Itanium, both from highly optimised IPC implementations on contemporary architectures.

Table I. One-Way Cross-Address-Space IPC Cost of Various L4 Kernels Name Year Architecture Processor MHz Cycles μs Original 1993 486 DX50 50 250 5.00 Original 1997 x86 (32-bit) Pentium 160 121 0.75 L4/MIPS 1997 MIPS64 R4700 100 100 1.00 L4/Alpha 1997 Alpha 21064 433 70–80 0.17 Hazelnut 2002 x86 (32-bit) Pentium II 400 273 0.68 Hazelnut 2002 x86 (32-bit) Pentium 4 1,400 2,000 1.38 Pistachio 2005 IA-64 Itanium 2 1,500 36 0.02 OKL4 2007 ARM v5 XScale 255 400 151 0.64 NOVA 2010 x86 (32-bit) Bloomfield 2,660 288 0.11 seL4 2013 x86 (32-bit) Haswell 3,400 301 0.09 seL4 2013 ARMv6 ARM11 532 188 0.35 seL4 2013 ARMv7 Cortex A9 1,000 316 0.32

... Table I shows the enormous influence of (micro-)architecture on context-switching and thus IPC costs. In terms of cycles, these kept increasing on x86, culminating in the 2,000-cycle IPC on the Pentium 4 (NetBurst?) architecture around the turn of the cen- tury. This created real challenges for microkernels: Our experience is that context switches that cost a few hundred cycles rarely ever matter, while thousands of cy- cles become a system performance issue, for example, with user-level drivers for high- bandwidth network interfaces. Fortunately, with the increased focus on power consumption, the architectural trend reversed: the latest-generation x86 processors, beginning with the Haswell microar- chitecture, enable context-switch times at par with RISC processors. " [26]

L4 Recursive page mappings

Some but not all variants of L4 provide recursive page mappings:

" Having a valid mapping of a page in its address space, a process had the right to map the page into another address space. Instead of mapping, a process could grant one of its pages, which removed the page, and any authority over it, from the grantor. A mapping, but not a grant, could be revoked by an unmap operation. " [27]

but

" 25% to 50% of kernel memory was consumed by the mapping database even without malicious processes ... allows two colluding processes to force the kernel to consume large amounts of memory by recursively mapping the same frame to different pages in each other’s address space, a potential DOS attack especially on 64-bit hardware, which can only prevented by controlling IPC (via the dreaded clans- and-chiefs, see Section 3.2.5). ... We replaced it by a model that more closely mirrors hardware, where mappings always originate from ranges of physical memory frames. ... Multiple approaches: Some L4 kernels retain the model of recursive address-space construction, while seL4 and OKL4 originate mappings from frames " [28]

L4 Kernel memory security against DOS

" While capabilities provide a clean and elegant model for delega tion, by themselves they do not solve the problem of resource management. A single malicious thread with grant right on a mapping can still use this to create a large number of mappings, forcing the kernel to consume large amounts of memory for metadata, and potentially DOS-ing the system ... Kernels that manage memory as a cache of user-level content only partially address this problem. While caching-based approaches remove the opportunity for DOS attacks based on memory exhaustion, they do not enable the strict isolation of kernel memory that is a prerequisite for performance isolation or real-time systems and likely intro duce side channels.

Liedtke et al. [1997b] examined this issue and proposed per-process kernel heaps... Per-process kernel heaps simplify user level, by removing control of allocation, at the expense of foregoing the ability to revoke allocations without destroying the process, and the ability to reason directly about allocated memory, as opposed to just bounding it. The tradeoff is still being explored in the community

We take a substantially different approach with seL4; its model for managing kernel memory is seL4’s main contribution to OS design. Motivated by the desire to reason about resource usage and isolation, we subject all kernel memory to authority conveyed by capabilities. The only exception is the fixed amount of memory used by the kernel to boot up, including its strictly bounded stack. Specifically, we completely remove the kernel heap and provide userland with a mechanism to identify authorised kernel memory whenever the kernel allocates data structures. A side effect is that this reduces the size and complexity of the kernel, a major bonus to verification.

The key is making all kernel objects explicit and subject to capability-based access control. This approach is inspired by hardware-based capability systems, specifically CAP [Needham and Walker 1977], where hardware-interpreted capabilities directly refer to memory. HiStar? [Zeldovich et al. 2011] also makes all kernel objects explicit, though it takes a caching approach to memory management.

Of course, user-visible kernel objects do not mean that someone with authority over a kernel object can directly read or write it. The capability provides the right to invoke (a subset of) object-specific methods, which includes destruction of the object. (Objects, once created, never change their size.) Crucially, the kernel object types include unused memory, called Untyped in seL4, which can be used to create other objects.

Specifically, the only operation possible on Untyped is to retype part of it into some object type. The relationship of the new object to the original Untyped is recorded in a capability derivation tree , which also records other kinds of capability derivation, such as the creation of capability copies with reduced privileges. Once some Untyped has been retyped, the only operation possible on the (corresponding part of) the original Untyped is to revoke the derived object (see below).

Retyping is the only way to create objects in seL4. Hence, by limiting access to Untyped memory, a system can control resource allocation. Retyping can also produce smaller Untyped objects, which can then be independently managed—this is key to delegating resource management. The derivation from Untyped also ensures the kernel integrity property that no two typed objects overlap. ...

There is one long-running operation, capability revocation, which requires preemption points. These ensure that the kernel is in a consistent state, that it has made progress, and to check for pending interrupts. If there are pending interrupts, then the kernel returns to usermode to restart the system call, which ensures that the interrupt gets handled first. The restarted system call continues the tear-down operation where the previous attempt was discontinued.

...

Objects are revoked by invoking the revoke() method on an Untyped object further up the tree; this will remove all capabilities to all objects derived from that Untyped . When the last capability to an object is removed, the object itself is deleted. This removes any in-kernel dependencies it may have with other objects, thus making it available for reuse. Removal of the last capability is easy to detect, as it cleans up the last leaf node in the capability tree referring to a particular memory location. " [29]

" Table III gives the complete set of seL4 object types and their use for 32-bit ARM pro- cessors, x86 is very similar. Userland can only directly access (load/store/fetch) memory corresponding to a Frame that is mapped in its address space by inserting the Frame capability into a Page Table.

Table III. seL4 Kernel Objects for 32-bit ARM Processors

" [30]

L4 process hierarchy (dropped)

"L4 does not have a first-class process concept, it is a higher- level abstraction that consists of an address space, represented by a page table, and a number of associated threads. These consume kernel resources, and unchecked alloca- tion of TCBs and page tables could easily lead to denial of service. Original L4 dealt with that through a process hierarchy: “Task IDs” were essentially capabilities over address spaces, allowing creation and deletion. " [31]

seL4 toread for me

section 2.1 of http://ssrg.nicta.com.au/publications/nicta_full_text/1852.pdf

section 2.1 of http://ssrg.nicta.com.au/publications/nicta_full_text/7371.pdf

seL4 manual: http://sel4.systems/Info/Docs/seL4-manual-latest.pdf

seL4: A Security-Focused Microkernel By David Chisnall

L4 links

Papers:

Implementations:

---

https://genode.org/documentation/genode-foundations-17-05.pdf

" Capability-based security supposedly makes security easy to use by providing an intuitive way to manage authority without the need for an all-encompassing and complex global system policy. Kernelization of software components aids the deconstruction of complex software into low-complexity security-sensitive parts and high-complexity parts. The lat- ter no longer need to be considered as part of the trusted computing base. Virtualization can bridge the gap between applications that expect current-generation OSes and a new operating-system design. The management of budgets within hierarchical organizations shows how limited resources can be utilized and still be properly accounted for. None of those techniques is new by any means. However, they have never been used as a composition of a general-purpose operating system. This is where Genode comes into the picture. Application-specific trusted computing base A Genode system is structured as a tree of components where each component (except for the root of the tree) is owned by its parent. The notion of ownership means both responsibility and control. Being responsible for its children, the parent has to explicitly provide the resources needed by its children out of its own resources. It is also responsible to acquaint children with one another and the outside world. In return, the parent retains ultimate control over each of its children. As the owner of a child, it has ultimate power over the child’s envi- ronment, the child’s view of the system, and the lifetime of the child. Each child can, in turn, have children, which yields a recursive system structure. Figure 1 illustrates the idea. At the root of the tree, there is a low-complexity microkernel that is always part of the TCB. The kernel is solely responsible to provide protection domains, threads of execution, and the controlled communication between protection domains. All other system functions such as device drivers, network stacks, file systems, runtime environ- ments, virtual machines, security functions, and resource multiplexers are realized as components within the tree "

" An RPC object provides a remote-procedure call (RPC) interface. Similar to a regular object, an RPC object can be constructed and accessed from within the same program. But in contrast to a regular object, it can also be called from the outside of the compo- nent. What a pointer is to a regular object, a capability is to an RPC object. It is a token that unambiguously refers to an RPC object. In the following, we represent an RPC object as follows ... However, there are two important differences between a capability and a pointer. First, in contrast to a pointer that can be created out of thin air (e. g., by casting an arbitrary number to a pointer), a capability cannot be created without an RPC object. At the creation time of an RPC object, Genode creates a so-called object identity that represents the RPC object in the kernel ... For each protection domain, the kernel maintains a so-called capability space, which is a name space that is local to the protection domain. At the creation time of an RPC object, the kernel creates a corresponding object identity and lets a slot in the protection domain’s capability space refer to the RPC object’s identity. From the component’s point of view, the RPC object A has the name 3. When interacting with the kernel, the component can use this number to refer to the RPC object A. ... Whenever the kernel delegates a capability from one to another protection domain, it inserts a reference to the RPC object’s identity into a free slot of the target’s capability space. Within protection domain 2 shown in Figure 3, the RPC object can be referred to by the number 5. Within protection domain 3, the same RPC object is known as 2. Note that the capability delegation does not hand over the ownership of the object identity to the target protection domain. The ownership is always retained by the protection domain that created the RPC object. "

---

https://fuchsia.googlesource.com/zircon

Zircon

Zircon is the core platform that powers the Fuchsia OS. Zircon is composed of a microkernel (source in kernel/...) as well as a small set of userspace services, drivers, and libraries (source in system/...) necessary for the system to boot, talk to hardware, load userspace processes and run them, etc. Fuchsia builds a much larger OS on top of this foundation.

The canonical Zircon Git repository is located at: https://fuchsia.googlesource.com/zircon

A read-only mirror of the code is present at: https://github.com/fuchsia-mirror/zircon

The Zircon Kernel provides syscalls to manage processes, threads, virtual memory, inter-process communication, waiting on object state changes, and locking (via futexes).

Currently there are some temporary syscalls that have been used for early bringup work, which will be going away in the future as the long term syscall API/ABI surface is finalized. The expectation is that there will be about 100 syscalls.

Zircon syscalls are generally non-blocking. The wait_one, wait_many port_wait and thread sleep being the notable exceptions.

---

akavel 1 day ago [-]

Personally, in my heart, I'm kinda keeping fingers crossed for a future where if one needs to/has to, one uses Rust (mostly where one would use C/C++ today, including for "squeezing the last drop of performance", and especially high risk everpresent libraries like libpng; plus as an extra, probably for highly ambitious parallelism like in Servo); otherwise, one just goes with Go (for the amazing general productivity and readability).

Though then there's also Luna-lang, which I'm hoping even stronger to become long-term a new breakthrough paradigm/tech in programming & scripting...

---

" When reading this I think of my deep knowledge of TCP (especially the pain points) and yet it is "The Internet" for the past while. Compared to simpler protocols like Delta-T and later/better protocols like XTP, TCP still "won" for a variety of reasons (adoption, good enough, numbers of implementation, hardware implementation, etc) "

---

" The IO subsystem is architected very differently in Windows than it is in Linux, with different goals in mind. Some of the major differences we have to deal with are:

    Linux has a top-level directory entry cache that means that certain queries (most notably stat calls) can be serviced without calling into the file system at all once an item is in the cache. Windows has no such cache, and leaves much more up to the file systems. A Win32 path like C:\dir\file gets translated to an NT path like \??\C:\dir\file, where \??\C: is a symlink in Object Manager to a device object like \Device\HarddiskVolume4. Once such a device object is encountered, the entire remainder of the path is just passed to the file system, which is very different to the centralized path parsing that VFS does in Linux.
    Windows's IO stack is extensible, allowing filter drivers to attach to volumes and intercept IO requests before the file system sees them. This is used for numerous things, including virus scanning, compression, encryption, file virtualization, things like OneDrive's files on demand feature, gathering pre-fetching data to speed up app startup, and much more. Even a clean install of Windows will have a number of filters present, particularly on the system volume (so if you have a D: drive or partition, I recommend using that instead, since it likely has fewer filters attached). Filters are involved in many IO operations, most notably creating/opening files.
    The NT file system API is designed around handles, not paths. Almost any operation requires opening the file first, which can be expensive. Even things that on the Win32 level seem to be a single call (e.g. DeleteFile) actually open and close the file under the hood. One of our biggest performance optimizations for DrvFs which we did several releases ago was the introduction of a new API that allows us to query file information without having to open it first.

Whether we like it or not (and we don't), file operations in Windows are more expensive than in Linux, even more so for those operations that only touch file metadata (such as stat). "

herf 6 hours ago [-]

I spent many years optimizing "stat-like" APIs for Picasa - Windows just feels very different than Linux once you're benchmarking.

It turns out Windows/SMB is very good at "give me all metadata over the wire for a directory" and not so fast at single file stat performance. On a high-latency network (e.g. Wi-Fi) the Windows approach is faster, but on a local disk (e.g., compiling code), Linux stat is faster.

reply

---

" Microkernels are closely related to exokernels.[11] " -- https://en.wikipedia.org/wiki/Microkernel

11 is Liedtke, Jochen (September 1996). "Towards Real Microkernels". Communications of the ACM. 39 (9): 70–77. doi:10.1145/234215.234473, and the quote from 11 that mentions exokernel: " Radical New DesignsA? new radical approach, designing a microkernel archi-tecture from scratch, seemed promising and necessary.Exokernel [9] and L4 [16], discussed here, both con-centrate on a minimal and clean new architecture andsupport highly extensible operating systems. • Exokernel.Exokernel, developed at MIT in 1994-95, is a small, hardware-dependent microkernelbased on the idea that abstractions are costly andrestrict flexibility [9]. The microkernel shouldmultiplex hardware primitives in a secure way. Thecurrent exokernel, which is tailored to the Mipsarchitecture and gets excellent performance forkernel primitives, is based on the philosophy that akernel should provide no abstractions but only aminimal set of primitives (although the Exokernelincludes device drivers). Consequently, the Exok-ernel interface is architecture dependent, dedicat-ed to software-controlled translation lookalikebuffers (TLBs). The basic communication primi-tive is the protected control transfer that crossesaddress spaces but does not transfer arguments. Alightweight RPC based on this primitive takes 10 μson a Mips R3000 processor, while a Mach RPCneeds 95 μs. Unanswered is the question ofwhether the right abstractions perform better andlead to better-structured and more efficient appli-cations than Exokernel’s primitives do.• L4. Developed at GMD in 1995, L4 is based on thetheses that efficiency and flexibility require a mini-mal set of general microkernel abstractions andthat microkernels are processor dependent. In[16], we show that even such compatible proces-sors as the 486 and the Pentium need differentmicrokernel implementations (with the sameAPI)—not only different coding but different algo-rithms and data structures. Like optimizing codegenerators, microkernels are inherently notportable, although they improve the portability of awhole system. L4 supplies three abstractions address spaces (described inthe next section), threads,and IPC—implements onlyseven system calls, and needsonly 12 Kbytes of code.Across-address-space IPC ona 486-DX50 takes 5 μs for an8-byte argument and 18 μsfor 512 bytes. The corre-sponding Mach numbers are115 μs (8 bytes) and 172 μs(512 bytes). With 2 x 5 μs,the basic L4-RPC is twice asfast as a conventional Unixsystem call. It remains unknown whether L4’sabstractions, despite being substantially more flexi-ble than the abstractions of the first generation, areflexible and powerful enough for all types of oper-ating systems. Both approaches seem to overcome the perfor-mance problem. Exokernel’s and L4’s communica-tion is up to 20 times faster than that offirst-generation IPC

" Operating systems generally present hardware resources to applications through high-level abstractions such as (virtual) file systems. The idea behind exokernels is to force as few abstractions as possible on application developers, enabling them to make as many decisions as possible about hardware abstractions.[2] Exokernels are tiny, since functionality is limited to ensuring protection and multiplexing of resources, which is considerably simpler than conventional microkernels' implementation of message passing and monolithic kernels' implementation of high-level abstractions. ... Design

The MIT exokernel manages hardware resources as follows:

Processor The kernel represents the processor resources as a timeline from which programs can allocate intervals of time. A program can yield the rest of its time slice to another designated program. The kernel notifies programs of processor events, such as interrupts, hardware exceptions, and the beginning or end of a time slice. If a program takes a long time to handle an event, the kernel will penalize it on subsequent time slice allocations; in extreme cases the kernel can abort the program.

Memory The kernel allocates physical memory pages to programs and controls the translation lookaside buffer. A program can share a page with another program by sending it a capability to access that page. The kernel ensures that programs access only pages for which they have a capability.

Disk storage The kernel identifies disk blocks to the application program by their physical block address, allowing the application to optimize data placement. When the program initializes its use of the disk, it provides the kernel with a function that the kernel can use to determine which blocks the program controls. The kernel uses this callback to verify that when it allocates a new block, the program claims only the block that was allocated in addition to those it already controlled.

Networking The kernel implements a programmable packet filter, which executes programs in a byte code language designed for easy security-checking by the kernel. " -- https://en.wikipedia.org/wiki/Exokernel

" Liedtke demonstrated with his own L4 microkernel that through careful design and implementation, and especially by following the minimality principle, IPC costs could be reduced by more than an order of magnitude compared to Mach. L4's IPC performance is still unbeaten across a range of architectures.[22][23][24] ... First-generation microkernels typically supported synchronous as well as asynchronous IPC, and suffered from poor IPC performance. Jochen Liedtke assumed the design and implementation of the IPC mechanisms to be the underlying reason for this poor performance. In his L4 microkernel he pioneered methods that lowered IPC costs by an order of magnitude.[13] These include an IPC system call that supports a send as well as a receive operation, making all IPC synchronous, and passing as much data as possible in registers. Furthermore, Liedtke introduced the concept of the direct process switch, where during an IPC execution an (incomplete) context switch is performed from the sender directly to the receiver. If, as in L4, part or all of the message is passed in registers, this transfers the in-register part of the message without any copying at all. Furthermore, the overhead of invoking the scheduler is avoided; this is especially beneficial in the common case where IPC is used in an RPC-type fashion by a client invoking a server. Another optimization, called lazy scheduling, avoids traversing scheduling queues during IPC by leaving threads that block during IPC in the ready queue. Once the scheduler is invoked, it moves such threads to the appropriate waiting queue. As in many cases a thread gets unblocked before the next scheduler invocation, this approach saves significant work. Similar approaches have since been adopted by QNX and MINIX 3.[citation needed]

In a series of experiments, Chen and Bershad compared memory cycles per instruction (MCPI) of monolithic Ultrix with those of microkernel Mach combined with a 4.3BSD Unix server running in user space. Their results explained Mach's poorer performance by higher MCPI and demonstrated that IPC alone is not responsible for much of the system overhead, suggesting that optimizations focused exclusively on IPC will have limited impact.[14] Liedtke later refined Chen and Bershad's results by making an observation that the bulk of the difference between Ultrix and Mach MCPI was caused by capacity cache-misses and concluding that drastically reducing the cache working set of a microkernel will solve the problem.[15]

In a client-server system, most communication is essentially synchronous, even if using asynchronous primitives, as the typical operation is a client invoking a server and then waiting for a reply. As it also lends itself to more efficient implementation, most microkernels generally followed L4's lead and only provided a synchronous IPC primitive. Asynchronous IPC could be implemented on top by using helper threads. However, experience has shown that the utility of synchronous IPC is dubious: synchronous IPC forces a multi-threaded design onto otherwise simple systems, with the resulting synchronization complexities. Moreover, an RPC-like server invocation sequentializes client and server, which should be avoided if they are running on separate cores. Versions of L4 deployed in commercial products have therefore found it necessary to add an asynchronous notification mechanism to better support asynchronous communication. This signal-like mechanism does not carry data and therefore does not require buffering by the kernel. By having two forms of IPC, they have nonetheless violated the principle of minimality. Other versions of L4 have switched to asynchronous IPC completely.[16]

As synchronous IPC blocks the first party until the other is ready, unrestricted use could easily lead to deadlocks. Furthermore, a client could easily mount a denial-of-service attack on a server by sending a request and never attempting to receive the reply. Therefore, synchronous IPC must provide a means to prevent indefinite blocking. Many microkernels provide timeouts on IPC calls, which limit the blocking time. In practice, choosing sensible timeout values is difficult, and systems almost inevitably use infinite timeouts for clients and zero timeouts for servers. As a consequence, the trend is towards not providing arbitrary timeouts, but only a flag which indicates that the IPC should fail immediately if the partner is not ready. This approach effectively provides a choice of the two timeout values of zero and infinity. Recent versions of L4 and MINIX have gone down this path (older versions of L4 used timeouts). QNX avoids the problem by requiring the client to specify the reply buffer as part of the message send call. When the server replies the kernel copies the data to the client's buffer, without having to wait for the client to receive the response explicitly.[17] ... Start up (booting) of a microkernel-based system requires device drivers, which are not part of the kernel. Typically this means that they are packaged with the kernel in the boot image, and the kernel supports a bootstrap protocol that defines how the drivers are located and started; this is the traditional bootstrap procedure of L4 microkernels. Some microkernels simplify this by placing some key drivers inside the kernel (in violation of the minimality principle), LynxOS? and the original Minix are examples. Some even include a file system in the kernel to simplify booting. A microkernel-based system may boot via multiboot compatible boot loader. Such systems usually load statically-linked servers to make an initial bootstrap or mount an OS image to continue bootstrapping. ... For efficiency, most microkernels contain schedulers and manage timers, in violation of the minimality principle and the principle of policy-mechanism separation. ... This has led to what is referred to as third-generation microkernels,[32] characterised by a security-oriented API with resource access controlled by capabilities, virtualization as a first-class concern, novel approaches to kernel resource management,[33] and a design goal of suitability for formal analysis, besides the usual goal of high performance. Examples are Coyotos, seL4, Nova,[34][35] Redox and Fiasco.OC.[34][36] ...

" -- https://en.wikipedia.org/wiki/Microkernel

---

some suggested changes to Plan 9:

https://aiju.de/plan_9/9p2020

some selected parts:

"

Rejected ideas

    Tmove: allowing unpriviliged (whatever that means) users to move directories around might break program assumptions, leading to all sort of crazy exploits. probably not worth it. (also the classic argument: why put it in 9p if it doesn't work with most file servers?)

General / misc changes

    Rerror contains an optional errno field that clients can use to check against common errors (e.g. EPERM)
    alternative: error strings are standardised in the spec, possibly by defining prefixes such as "Eperm:"
    Servers are required to reply to undefined messages with Rerror

Open/create changes

    Tcreate without OEXCL functions like Topen if the file exists

Stat changes

    atime and mtime are now 64 bits and in nanoseconds.
    mode DMSANE / qid.type QTSANE: the file server promises that Tread respects offset (unless DMDIR is set) and has no side effects.
    mode DMCACHE / qid.type QTCACHE: the file server honors qid version (the file contents are guaranteed to be unchanged if qid version is unchanged)
    If some flag (explicit byte in Twstat? funny character in name?) is set, changing the name with Twstat will delete an existing file with the new name, if possible (atomic replace).
    If the new name starts with a slash, the file is moved rather than renamed. Support for this is optional. see Tmove
    Add stat and error fields to Rwstat; if a server can only do a partial change (which is recommended they should strongly avoid for atomicity reasons), it can signal this by returning a stat field with all successfully changed fields set to -1 and return an error [this change is mostly to support fileservers that translate to another protocol that doesn't make the strong guaratees on attribute changes that 9p2000 makes]

Walk changes

    remove 16 element limit?
    Rwalk contains an error field
    For use with chaining/grouping: A new field badqid[s] is added. It encodes a bitmap of variable length n bits (which has to be a multiple of 8). If the hash of the qid (hash function tbc) modulo n has the corresponding bit set, walks will stop at that qid and fail.

Read changes

    EOF flag in Rread (only used with DMSANE): another read right now would return 0
    Reading a directory with offset –1 overrides the "offset equals last offset plus count" check [maybe get rid of it entirely?]
    Reading a file with offset –1 is equivalent to reading with the last offset + last returned count

Write changes

    Rwrite contains an error field

...

9P grouping

(Alternative to chaining)

All messages contain two extra fields: grpbits[1] grptag[2]. Servers guarantee not to reorder messages with identical grptag. If grpbits contains GRPKILL then a failed or flushed message will mark the grptag as killed. If grpbits contains GRPMORTAL then the message is aborted if grptag is killed by a previous message.

NOTAG refers to no group (or rather, each use of it refers to a separate group), i.e. different messages with NOTAG can be reordered freely and grpbits has no effect.

A new message Tclunkgrp[2] resets the kill tag. Servers can assume that the group tag refers to a new group after this request (i.e. the reordering dependency chain is broken). "

---

https://docs.huihoo.com/plan9/Plan9.pdf

" UnixWhat? was different: Small, fast, lean andsimple.In other words: KISSUnix distilled and polished many of the Multics ideas, and added some of its own. Hierarchical file system. Everything is a file. Each tool does one thing and does it well. Processes communicate and interact through pipes. Text is the universal language. Written in a high-level language

...

What Plan 9 doesn’t have... root, suid tty, curses ioctl sockets select/poll symlinks pthreads mmap dynamic linking loadable kernel modules locales gcc C++ emacs (or vi) X11 XML WEB 2.0

...

Main ideasThree major concepts Everythingis a file tree (file system):Allresources are named and accessedlike files in a hierarchical file system. No files are more special than others. 9P:A single protocol to securely and transparently access resourcesindependently of their location. Private namespaces:Every process can customize its view of the world bychanging its private namespace that joins together the file hierarchies providedby different servers representing the network resources

...

All resources as file systems Networking:/net/ Authentication:/mnt/factotum/ Encryption:/net/tls/ Graphics:/dev/draw/ Window system:/dev/wsys/ Process control and debugging:/proc/ Environment:/env/ Email:/mail/fs/ Boring stuff: cdfs, webfs, tarfs, ftpfs, etc... Weird stuff: wikifs, gpsfs, sshnet, iostats and others.

...

The 9P protocolLightweight network file system (but in Plan 9,everythingis a file.) Not block oriented, byte oriented. Minimalistic and lightweight: Only 13 message types (NFSv3 has 22, NFSv4has 42) Can run over any reliable transport (directly over TCP, IL, RUDP, PPP, sharedmemory, serial port connections, PCI bus and others) Encompassing: used for local and remote access; forsyntheticandphysicalfiles. Authentication and encryption agnostic. Designed for transparent caching and stacking

Namespaces

Imagine using a programming language where all variables were global......that is how Plan 9 users feel when they have to use Unix.

Three namespace operations: int bind(char *name, char *old, int flag) int mount(int fd, int afd, char *old, int flag,char *aname) int unmount(char *name, char *old)

Example:/sys/src/libdraw/newwindow.c From the shell: % mount ‘{cat /env/wsys} /n/foo new % cat /proc/$pid/ns

Union mounts

Flags for bind and mount:

MREPL: Replace the old file by the new one. Henceforth, an evaluation of oldwill be translated to the new file. MBEFORE: Both the old and new files must be directories. Add the files ofthe new directory to the union directory at old so its contents appear first in theunion. MAFTER: Like MBEFORE but the new directory goes at the end of the union. MCREATE: OR’d with any of the above. Whencreateattempts to create in anew file in a union directory will create the file in the first mount with theMCREATEflag

Example:% ns

grep /bin

The kernel Linux

Over 300 syscalls and counting; hundreds (thousands?) of ioctls. 4,827,671 lines of code.Others not better... Unix not small anymore.

Plan 9 37 syscalls; no networking related syscalls; no ioctls. 148,787 lines of code. Optional real-time scheduler. Microkernel or monolithic kernel? Who cares? Inside the kernel or outside the kernel? Do what makes most sense, can changeit later without breaking the interface. Pragmatic, not dogmatic design.

...

ken’s C

Very simplified preprocessor. Support for UTF-8 literals. Some small and convenient extensions

...

Opening a network connection in Unix

  1. include <sys/types.h>#include <sys/socket.h>#include <netinet/in.h>#include <netdb.h>...struct sockaddr_in sock_in;struct servent *sp;struct hostent *host;...memset(&sock_in, 0, sizeof (sock_in));sock_in.sin_family = AF_INET;f = socket(AF_INET, SOCK_STREAM, 0);if (f < 0)error("socket");if (bind(f, (struct sockaddr*)&sock_in, sizeof sock_in) < 0){error("bind");host = gethostbyname(argv[1]);if(host){sock_in.sin_family = host->h_addrtype;memmove(&sock_in.sin_addr, host->h_addr, host->h_length); ...}else{sock_in.sin_family = AF_INET;sock_in.sin_addr.s_addr = inet_addr(argv[1]);if (sock_in.sin_addr.s_addr == -1)error("unknown host %s", argv[1]);}sp = getservbyname("discard", "tcp");if (sp)sock_in.sin_port = sp->s_port;elsesock_in.sin_port = htons(9);if (connect(f, (struct sockaddr*)&sock_in, sizeof sock_in) < 0){error("connect:");

Tedious and clunky Protocol-specific details leak thru the API C-specific, hard to access from elsewhere without ugly wrapping

Opening a network connection In Plan 9

  1. include <u.h>#include <libc.h>...fd = dial(netmkaddr(argv[1], "tcp", "discard"), 0, 0, 0);if(fd < 0)sysfatal("can’t dial %s: %r", argv[1]);

Clean and simple Abstracts the protocol details from the application, addresses

net /net/cs /net/dns /net/tcp/ /net/udp/ /net/etherX/

No need for NAT or VPN, just import/netfrom the g ateway or remote host,all applications will use it transparently. Can mount multiple local IP stacks concurrently. Each stack is physicallyindependent of all others. Normally a system uses only one stack. Multiplestacks can be used for debugging, implementing firewalls or proxy services,eg., ipmux.

Fossil and Venti Venti provides block storage indexed by hash. Duplicated blocks stored onlyonce. Fossil: Permanent storage of files with automatic snapshots using Venti forstorage.Examples: To see what changes have been made to the user database since the systemwas setup:% history -D /adm/users To compile a kernel with the version of the Intel PRO/1000 ethernet driverfrom 3 days ago:% cd /sys/src/9/pc% yesterday -b -n 3 ./etherigbe.c

Security, factotum and secstoreNo root and no suid.When everyone has the same privileges(none), escalation is impossible. Namespaces provide isolation, much cleaner and secure than chroot. Authentication and encryption implemented in a single place, thru file sys-tems!Factotum Handles all authentication tasks. Transparently for the application. Manages keys, keeps private keys secret from applications! Can store all keys securely in asecstoreencrypted with a master key (often inremobable media).

Concurrency with Communicating Sequential ProcessesIntroduced? by Hoare’s 1978 paper, based on Dijkstra’s work. Book online at:http://www.usingcsp.com/ Procs: preemptively scheduled by operating system, shared address space. Threads: cooperatively scheduled coroutines within a proc. (Notpthreads!think coroutines. ) Channels: all communication is done through synchronous typed channels.Avoids locking and simplifies design of concurrent systems.Implementations: Alef: compiled language, for systems. Limbo: portable, bytecode-based. The main language used in Inferno. Libthread: translate Alef programs to C. Lose syntactic sugar. Others: Occam, concurrent ML, Moby, Erlang, StacklessPython?

Inferno/LimboBased? on the ideas researched in Plan 9 but taking a more radical approach. Limbo: New GC concurrent language while keeping the C philosophy. Dis: Virtual machine designed for portability and JIT. Can run directly on hardware or hosted on almost any operating system (Plan9, Linux, *BSD, Solaris, Irix, Windows, MacOS? and others.) Extremely portable, doesn’t require MMU! Processesextremelycheap.Inferno and Plan 9 complement each other, for example: Can mount/netin Plan 9 from Inferno instances running on any OS. Can debug embeded Inferno systems remotely from Plan 9. Can write cross platform applications in Limbo. Allows integration of non-Plan 9 systems in a Plan 9 network, Inferno Grid.Released last year under a GPL/MIT dual license

Resources and referenceshttp:9fans.net- The main Plan 9 website.http://plan9.bell-labs.com/sys/doc/- The main collection of papers.http://plan9.bell-labs.com/wiki/plan9/Papers/- Other papers and publications.http://www.vitanuova.com/inferno/net_download4T.html- Inferno.http://plan9.us- Plan 9 from User Space for Unix systems.http://www.usingcsp.com/ - Communicating Sequential Processesby Tony Hoare " -- https://docs.huihoo.com/plan9/Plan9.pdf

---

https://wiki.osdev.org/Microkernel A Microkernel tries to run most services - like networking, filesystem, etc. - as daemons / servers in user space. All that's left to do for the kernel are basic services, like memory allocation (however, the actual memory manager is implemented in userspace), scheduling, and messaging (Inter Process Communication).

In theory, this concept makes the kernel more responsive (since much functionality resides in preemptible user-space threads and processes, removing the need for context-switching into the kernel proper), and improves the stability of the kernel by reducing the amount of code running in kernel space. There are also additional benefits for OS' that support multi-CPU computers (much simpler re-entrancy protection and better suitability for asynchronious functionality) and distributed OS' (code can use services without knowing if the service provider is running on the same computer or not). A drawback is the amount of messaging and Context Switching involved, which makes microkernels conceptually slower than monolithic kernels. (This isn't to say that a smartly designed microkernel could not beat a foolishly designed monolithic one.)

In practice things can be quite different. For example, if the filesystem crashed, the kernel would continue running, but the user would still lose some data - unless provisions were made to restart the filesystem server / daemon, and a data recovery system exists. Microkernels can be more stable, but it requires additional design work; it does not come free with the architecture ... AmigaOS?, for example, was a microkernel - and an unusual one: Since the original AmigaOS? had no memory protection, its messaging was as quick as it could get (passing a pointer to memory), making the AmigaOS? kernel one of the fastest ever devised. On the other hand, that lack of memory protection also meant that the microkernel architecture gave no added stability (later versions did implement MMU support, but at the same speed cost that affects other microkernel systems). ... Examples

    Mach
    QNX
    L4
    AmigaOS
    Minix 

See Also Forum Threads

    microkernels
    discussing microkernel vs modular macrokernel 

Recommended Reading

    The Increasing Irrelevance of IPC performance in Microkernel-based Operating Systems by Brian N. Bershad
    The Persistent Relevance of IPC performance in Microkernel-based Operating Systems by Wilson C. Hsieh, M. Frans Kaashoek, and William E. Weihl
    µ-Kernels Must And Can Be Small by Jochen Liedtke
    Microkernel-based OS Efforts by Christopher Browne
    Microkernel Construction Notes by Raphael Neider

https://wiki.osdev.org/Nanokernel#Nanokernel.2FPicokernel

 Nanokernel/Picokernel

Nanokernels and picokernels are usually small kernels considered by their creators to be even smaller than microkernels. Examples include: Adeos, KeyKOS?, and LSE/OS. Another very famous example is the symbian EKA2 kernel. This nanokernel implements drivers inside the kernel making it not fully a microkernel.

http://home.gna.org/adeos/ https://www.cis.upenn.edu/~KeyKOS/NanoKernel/NanoKernel.html http://lseos.sourceforge.net/

---

keykos microkernel:

http://cap-lore.com/CapTheory/KK/Arch/

capability-based OS

http://cap-lore.com/CapTheory/KK/

---

http://lseos.sourceforge.net/

lse os microkernel

http://lseos.sourceforge.net/operating_system.htm http://lseos.sourceforge.net/coresrv.htm http://lseos.sourceforge.net/libc.htm http://lseos.sourceforge.net/architecture.htm


Symbian EKA2 microkernel

https://media.wiley.com/product_data/excerpt/47/04700252/0470025247.pdf https://web.archive.org/web/20120616044816/http://www.developer.nokia.com/Community/Wiki/Symbian_OS_Internals

---

A great lecture (notes) introducting L4:

http://web.archive.org/web/20140803112320/http://i30www.ira.uka.de/~neider/edu/mkc/mkc.html#id2477422

" Basic Abstractions and Mechanisms

The L4 microkernel, which is the basis for this lecture, was designed from scratch (rather than stripping existing systems down) and implemented to provide fast IPC. It offers two abstractions: Threads to abstract from the CPU, and address spaces to protect threads against each other. In order to be able to manipulate these abstractions, the L4 kernel also provides two mechanisms: inter-process communication (IPC) allows threads to communicate and to synchronize, whereas mapping allows to recursively populate address spaces.

The kernel maps all hardware properties to these abstractions: All activities (including I/O devices) are represented in threads, interrupts are transformed into an IPC from the device thread to the driver thread. Pagefaults are wrapped into a pagefault IPC from the faulting thread to another thread that is responsible for providing memory (a so-called pager), and (physical) memory is given to an address space by mapping memory from the pager (the pager itself requests the memory from its respective pager, the root pager idempotently maps physical memory to its clients on demand).

...

Message Content

The message itself is stored in 64 word-wide virtual message registers (MR0 .. MR63), each of which is realized either in a physical processor register or in a thread-local region of memory (see the section called "Virtual Registers and the UTCB").

The message itself consists of a message tag in MR0, followed by 0..63 untyped words (which are simply copied during an IPC), followed by 0..63 typed words. Typed words (or typed items) are interpreted by the kernel and serve to encode strings and mappings.

...

Besides the state, also thread-local management data is stored in the TCB:

    the global thread ID (due to the way TCBs are mapped to threads, see the section called "Locating TCBs"
    the local thread ID (in order to access the UTCB from inside the kernel, see the section called "Virtual Registers and the UTCB"
    resources (such as the floating point unit or the temporary mapping area, see the section called "Temporary Mapping Area") currently used by the thread as these are relevant during scheduling/thread switching
    a thread state designator (ready, waiting, locked running, ...)
    various scheduling parameters such as priority, total and remaining timeslice length, total quantum (maximum total CPU time allowed for the thread, currently unused)
    doubly linked list pointers linking all threads in the same state (ready list per priority, waiting list, ...)
    a pointer to the list of timeouts for the thread
    a pointer to the structure that describes the thread's address space (e.g., a pointer to the top level page directory)
    a reference to the hardware address space identifier (physical address of the top level page directory (%cr3) for Intel or ASID/TLB tag for architectures with tagged TLBs)

...

Communication Primitives

For message-based communication, at least two primitives are required:

    send to (specified thread)
    receive from (specified thread) 

These operations allow for two threads to communicate with each other and prevent interference from other threads: The receiver only accepts messages from the intended sender, aka closed receive). Other threads must wait until the receiver accepts messages from them. However, servers that offer their services to all threads are difficult to implement using the closed receive primitive for two reasons:

    The server must know all potential client thread IDs.
    As the thread blocks during receive (assuming the client has not yet issued a request),
        the server either has to provide one receiver thread per client, or
        the server must poll all clients for unhandled requests. 

Avoiding these problems, the kernel should offer an additional primitive (known as an open receive):

    receive (from any thread) 

Looking at the (expected) common interaction patterns in a multi-server system, clients will typically send requests to servers and then wait for the answer (cf. RPC). If in such a scenario the client is preempted after having sent the request but before having issued the accompanying receive, the server might try to deliver the result (answer message) to the client before the client is receiving. As a consequence, the server would block and thus be prevented from servicing other requests. To solve this problem, two approaches can be considered:

    The server could check whether the client is ready to receive and store the answer message in some queue if it is not. As this would consume server resources, this approach would enable denial-of-service attacks against the server. Worse, the server cannot avoid having to store some answers, as even well-behaving, non-malicious clients could be preempted between their send request and receive response calls.
    The better solution provides another IPC primitive to let the clients express the send and receive phases atomically. Such a primitive would guarantee that the client is ready to receive as soon as the request message has been delivered to the server. As a consequence, the server could assume that well-behaving clients use the atomic send-and-receive primitive and discard answers if the client turns out not to be ready to receive it. 

Combined send-and-receive primitives could be:

    call (send to specific thread and receive from same)
    reply-and-wait (send to specific thread and receive from any)
    reply-and-wait for other (send to specific thread and wait for message from other specific thread) 

[Note]

call is the typical client-side primitive when requesting a service, whereas reply-and-wait would be the primitive most suitable for the server (send answer to client A and wait for requests from any client). The third combination has yet to proof its worth, but comes for free (see the section called "Operation and Addresses"). Message Types

Depending on the size of the messages to be transferred, we can distinguish three message types:

Registers Short messages can be transferred in (virtual) registers. Such message transfers avoid memory accesses wherever possible, thus avoiding cache and TLB misses as well as user-level pagefaults. Strings Contiguous data can be copied from the sender address space to the receiver address space via string IPC. This is the most general form of messages, but also the slowest (the message has to be copied at least once). Pages If large amounts of data are to be transferred, the virtual memory mechanism can be used to establish shared memory between the sender and the receiver. This allows to easily make data available to multiple address spaces without ever copying the data. Unlike the other two message types, this type offers call-by-reference semantics (rather than call-by-value), i.e., changes to the data in the receiver's address space are visible in the sender's address space as well. Should this not be desired, this form of communication should allow restricting the access rights for the receiver.

[Note]

Besides memory pages, other resources can be mapped as well (e.g., access to I/O ports on IA32).

The distinction becomes important when implementing the IPC primitives, as we shall discuss in Chapter 6, IPC Implementation.

...

Encoding of IPC Parameters

For every IPC operation, a number of parameters such as the intended receiver, the type and contents of the message and the desired operation (send, receive, call, ...), or the desired timeouts (if any) must be specified. Clever encoding of these can avoid superfluous memory accesses and thus help to make IPC fast. Operation and Addresses

To avoid code duplication, L4 only implements a single IPC system call, which implements the atomic send-and-receive operation between threads. As a consequence, the system call requires two addresses:

destination thread ID The ID of the thread that is to receive the message that is sent by this invocation of the IPC syscall. If no send phase is desired, a special NIL thread ID (0) can be specified. receive specifier

    If not NIL, this thread ID specifies the thread that is expected to send a message to the invoking (i.e., the current) thread. The receive phase always follows the send phase (if any): Typically, having received a message requires some sort of processing before a meaningful answer can be sent. An exception to this rule would be a simple "ACK", which is implicitly passed to the originator as the return code of the send operation: If the send returns without errors, the message has been delivered (see the section called "Synchronous Communication").
    If the receive specifier is NIL, no receive phase is initiated and the syscall returns right after the send phase (if any).
    For an open receive, which would allow the current thread to receive a message from any thread, another special thread ID can be used: the wildcard thread ID (-1).

This scheme allows to encode the desired operation (send to, receive from, receive from any, call, or reply and wait) and all relevant threads in only two words.

As both addresses must be inspected in all IPC calls, these should be passed in registers rather than memory to the kernel.

...

Fast Path IPC

Short IPC, meaning IPC with its message solely contained in untyped message registers without strings or mappings, is (assumed to be) the most frequent kernel operation in microkernel-based systems. This justifies taking special care on its implementation to avoid most of the possible performance problems (cache misses, TLB misses, memory references in general, down to microarchitectural effects such as pipeline stalls and flushes or instruction scheduling; the latter shall not be discussed here in more detail).

In order to retain in full control over these effects, L4 features two IPC paths:

    The fast path is implemented in hand-crafted assembler, but handles only the easy (hopefully common) case:
        short IPC (only untyped message words)
        call or reply-and-wait semantics, so that no scheduling decision is required
            the sender blocks after sending (waits for reply or next request)
            the receiver is waiting and dispatched after sending 
        no effective timeouts
            send timeout: irrelevant as the receiver must be waiting
            xfer timeouts: irrelevant due to the required absence of string items
            recv timeout: must be infinite (only effective timeout)

...

Scheduling in L4

Whenever L4 must choose the next thread to run, it consults its built in scheduler. The scheduler uses

fixed priorities (0..255) per thread Although the threads' priorities can be changed at runtime by the user, the kernel does not change them as, e.g., earliest-deadline-first-based schedulers might; 255 signifying the highest priority) hard priorities That the scheduler always chooses one of the ready threads with the highest priority, low priority threads can starve. round robin per priority

    Threads of the same priority are scheduled in round-robin fashion for a given timeslice; at the end of the timeslice, the thread is preempted and another thread is dispatched. For this purpose, the L4 scheduler maintains
        a doubly linked, circular list of all ready TCBs per priority level (list pointers are embedded in the TCBs to avoid allocation and deallocation overhead and to increase locality: the list node and the TCB share the same TLB entry), and
        an array of 256 pointers to the TCB of the currently running/next thread per priority level. 

In principle, the scheduler starts at the array entry for the highest priority and checks if there is a thread ready to run (array entry is not nil). In this case, the thread is selected for dispatch, otherwise, the array is searched in order of decreasing priorities. If no thread is ready to run, a special idle thread is dispatched.

When the current thread blocks, yields, or exhausts its timeslice, the next thread in that priority class is dispatched and given a new timeslice. Additionally, the pointer to the currently running thread in the array is updated to point to the selected thread. [Note]

Timeslice accounting is slightly inaccurate as it happens in the handler of the periodic timer interrupt by subtracting the length of the timer period from the remaining timeslice length. If the remaining timeslice becomes zero or negative, it is said to be exhausted, causing an immediate scheduling process. If the thread blocks in between two timer ticks, the next thread will start with only half a timer period till the next timer tick. Whether it is billed a whole timer period or whether this case is correctly accounted for is beyond the author's knowledge.

Optimizations

Usually, threads have priorities around 100 on L4. To avoid checking the priority levels 255..101 for runnable threads, L4 remembers the highest priority with a runnable thread and only checks at and below that level during scheduling.

This so-called "highest active priority" must be updated whenever a thread with a higher priority becomes ready to run, and should be updated whenever the currently highest active priority level becomes empty (because the last thread on that level blocks). Both events can easily be registered in the kernel.

...

Dispatching Mechanism

Finding the next thread to run can be quite time consuming (esp. if no bsr instruction can be used) and should thus be preemptible itself, in order to achieve a low interrupt latency. If the scheduler were to execute in the context of the current thread (the one that is to be replaced), this goal would be hard to achieve: On interrupts, the current state of the interrupted thread would be saved, the (in-kernel) interrupt handler would still execute on the stack of the current thread and possibly invoke the scheduler again. When the current thread is scheduled again, it would resume its preempted scheduling activity, although the thread itself should continue! Additionally, the interrupt handler and nested scheduler consume space on the threads kernel stack, which, being included in the TCB, is quite limited.

Solving all these problems, L4 uses a dedicated thread for the scheduling decisions. This thread is never resumed after preemption but rather started anew, discarding its previous state (the scheduling-relevant data structures have possibly changed since the scheduler has been preempted, so a fresh start is recommended). This property also removes the requirement for an exceptionally large kernel stack; except for a single interrupt handler, nothing will be nested on top of the scheduler.

The consequences of this model are that dispatching via the scheduler requires two thread switches: first from the current to the dispatcher thread, and then from dispatcher to the next thread. However, as the dispatcher never executes in user mode, it does not have an associated address space, so that switching there is cheap (no address space switch, no TLB flush). As it is never referenced from user mode, it can even do without a thread ID, which further allows to freely place (and size) its stack; it need not be included in any TCB. Lazy Dispatching

L4 uses circular ready lists per priority level. If these lists were maintained eagerly, a thread would have to be removed whenever it blocks (e.g., during a call IPC, waiting for the reply). Removing a thread from the ready list requires touching its preceding list element (to update its next pointer) as well as touching its successor list element (to update its prev pointer) in addition to the list element itself. With the list being embedded in TCBs, this translates to having to touch three TCBs (predecessor, successor, and current), each one probably covered by a different TLB entry. Though we cannot avoid touching the current TCB, and though we will probably switch to the next thread anyways, so touching the successor list element does not "waste" a TLB entry for the single purpose of updating the ready list, touching the predecessor TCB solely serves the purpose of keeping the ready list up-to-date. The possible TLB miss on touching the predecessor TCB can be avoided by maintaining the ready lists lazily as follows.

Instead of removing a blocking thread's TCB from the ready list immediately, we just leave it in there and simply set its state to blocked

...

Switching Threads

Switching threads is conceptually easy:

    Store the CPU state of the currently running thread in its TCB
    Load the CPU state of the thread to resume from its TCB into the CPU 

However, two problems arise: First, TCBs must be protected kernel objects, so that no thread on user-level can manipulate other threads' states. Furthermore, threads must not be allowed to modify even their own TCB, if/as the TCBs contain scheduling information (e.g., the remaining timeslice length).

As a consequence, thread switches must (most of the time, see Chapter 10, Local IPC) be performed by the kernel, which means that the kernel must run in order to do so. If the kernel is running, the user-level state of the thread must be preserved somewhere in order to be able to correctly resume it. For this second problem, some hardware support on entering the kernel is required: On kernel entry, the user-level CPU state could be saved on a stack (x86), or the kernel can use a different set of CPU state (register banks, found in MIPS or Itanium machines).

On stack-based machines (x86), a thread switch thus requires the following steps:

    A thread executes in user mode
    Some event triggers a thread switch (interrupt, exception, syscall, ...)
    The system enters kernel mode
        The hardware automatically stores the user-level stack pointer ...
        ... loads a kernel stack pointer
        and saves the user-level instruction pointer and stack pointer on the kernel stack
        The hardware loads a kernel instruction pointer based on the cause of the kernel entry (interrupt, exception, syscall, ...) 
    The kernel code saves the remaining CPU state into the current thread's TCB before overwriting it
    The kernel loads the CPU state of the next thread from its TCB ...
    ... and returns to user-mode, causing the hardware to
        load the user-level instruction pointer and
        the user-level stack pointer from the current stack 

Kernel Stack Models

Concerning the stack being used in kernel mode, different models are possible. The kernel stack should always be different from the user-level stack for security reasons: First, the user can set up an invalid stack pointer before entering kernel mode, loading a well-known, valid stack pointer on kernel entry is likely to be easier than validating the current value. Secondly, the kernel entry might be caused by a page fault on the user-level stack. Storing user-level state on this stack in order to handle the page fault will thus raise further page faults (accesses the same non-mapped memory) which can never be handled.

The presented models differ in the number of kernel-level stacks. Per-Processor Kernel Stack

All threads executing on one CPU can use the same kernel stack. [2] As a consequence, either only one thread can execute in kernel mode at any time (i.e., threads executing in kernel mode cannot be preempted), or unusual approaches such as continuations must be used.

This model is efficient with respect to cache lines used during a context switch (only one stack), but uncomfortable from the kernel developer's point of view. Per-Thread Kernel Stack

Alternatively, each thread can have its own kernel-level stack. This model resembles user-level code and makes kernel mode nothing special from the kernel developer's point of view. However, this model is more costly both with respect to virtual memory use (only one stack must be addressable in the previous model, whereas this model requires (number of threads) stacks to be present in virtual memory) and with respect to cache line use on thread switch: Both the stacks of the current thread an the stack of the next thread are accessed, effectively doubling the number of cache lines used for them.

Due to its simplicity, L4 employs the per-thread stack model and integrates the kernel stacks into the TCBs. (TCB plus stack fit into a single page to avoid additional TLB misses.)

...

Fast System Calls on x86

On the other hand, using int is quite slow, especially due to the memory lookup of both the kernel stack pointer (%esp0) and the entry point of the handler routine in the interrupt descriptor table (IDT). As kernel entries/syscalls are frequent (especially in microkernel-based systems), Intel introduced a less costly alternative kernel entry facility via the sysenter/sysexit instructions. [4]

These new instructions avoid memory lookups by providing CPU-internal special purpose registers (model-specific registers, MSRs) to store a single entry point into the kernel as well as a single kernel stack pointer to load on kernel entry via sysenter. Furthermore these instructions effectively disable segmentation by setting up flat segments (full 4 GB segments starting at linear/virtual address 0), which allows to skip additional costly segmentation-related checks on mode changes between kernel-mode and user-mode. As most of today's operating systems (including L4) ignore segmentation and use flat segments already (using only paging for access control and physical memory management), this is usually not a problem (but see Chapter 9, Small Spaces).

As sysenter always jumps to the same handler routine in the kernel, an argument (a register) would be required to select one of the syscalls offered by the kernel. Instead of sacrificing a register for this purpose, L4 implements only the most frequently used system call, namely IPC, via sysenter, all other system calls use the traditional kernel entry method via int and either have an explicit argument denoting the system call number or use different interrupt vectors (int 0x80, int 0x81, ...) for the different system calls.

As L4 aims not only at maximum speed but also at portability, systems without the (relatively) new sysenter instruction shall also be supported and the optimal method of entering the kernel shall be selected at runtime. A problem here is that the stack layout after kernel entry via sysenter differs from the layout after using int: Using sysenter, the hardware only

    loads %eip from a MSR and forces %cs to the flat segment (0..4 GB),
    loads %esp from a MSR and forces %ss to the flat segment (0..4 GB),
    turns off interrupts,
    and enters kernel mode. 

Nothing is stored on the stack, the user-level instruction pointer and stack pointer to use when returning to user-level must be passed in registers (by convention and because sysexit expects them there, %eip is passed in %edx and %esp in %ecx).

...

IPC and Scheduling [Note]

The following elaboration mainly focuses on single-processor systems, i.e., only one thread executes on the CPU at any time.

Furthermore, we assume a strict priority-based scheduling strategy with fixed priorities and round-robin per priority level as provided by L4 (see Chapter 7, Dispatching), though the ideas presented should apply more generally as well.

The implementation of synchronous IPC primitives (send, receive, and combined operations such as call or reply-and-wait) are tightly related to scheduling: Whenever a thread waits for an IPC partner (trying to send to a thread that is currently not accepting messages, or trying to receive a message while none is pending), the thread is blocked and another thread needs to be scheduled.

Conversely, if the send phase of a call operation completes, the sender will wait for the response of the receiver. This (common!) case can be improved by avoiding to go through the scheduler and instead dispatch the receiver thread directly, allowing it to use the remainder of the sender's timeslice to start fulfilling its request. Executing the receiver in the sender's timeslice is called ((timeslice donation)); the sender would block anyway (after the call), but instead of losing its assigned CPU time, the sender passes it on the receiver so that the latter can do some work on the sender's behalf.

This approach is fair in all respects if the sender used the call primitive:

    The sender cannot continue execution, but wants the receiver to do some work; donating processor time can only speed things up for the sender.
    The receiver gets extra CPU time; the remainder of the sender's timeslice is accounted for by the sender (not by the receiver) and does not affect later scheduling decisions for (or against) the receiver thread.
    The receiver effectively inherits the priority of the sender for the donated timeslice, so that the receiver may run even if it has a low priority. This never leads to threads with a high(er) priority being delayed longer than till the end of the sender's timeslice, which would also have to be tolerated if the sender had not sent a message to the receiver, and thus does not violate scheduling guarantees for threads other than the sender (who donated some processor time to the receiver, thus is slightly penalized). 

Besides a call to a waiting receiver, other (attempted) IPC operations also require dispatching/scheduling actions:

send, receive, or call while the partner is not waiting This is the most obvious and easiest case, as the only "choice" is to block the current thread (set it to waiting) and schedule another one (either according to the scheduling policy or (if possible) the missing partner thread so as to allow the IPC to complete as soon as possible). receive from a partner that has previously called us In this case, the transfer can complete immediately and the receiver continues (the partner thread is now waiting for the receiver's reply). call a partner that is waiting for us As described before, the receiver is dispatched and allowed to use the remainder of the sender's timeslice (timeslice donation). receive from a partner that has previously invoked a send for us send to a partner that has previously invoked a recv or called us

    These cases are different from the previous ones: Before, either the sender or the receiver (or none of them) was ready to run after the invocation of the IPC primitive; the other thread blocked. If (as in these cases) both threads are ready to run after the IPC, a true scheduling decision must be made:
        Does the awakened/unblocked thread have a higher priority than the currently running thread? If so, we must switch threads (the unblocked thread will be the only ready-to-run thread in the highest ready priority level, otherwise the currently running thread could not have been running (priority too low!)).
        If the unblocked thread has a lower priority that the current thread, continue executing the current thread.
        If the two threads have the same priority, a policy decision is required: Continue to execute the current thread until the end of its timeslice (lazy IPC operation) or immediately switch to the partner (eager)?
        [The author of this document believes that the eager approach unjustifiably penalizes the current thread (cutting short its timeslice though it has work to do) and thus favors the lazy approach, which additionally avoids the overhead of a thread switch. L4 implements the lazy approach.]" -- http://web.archive.org/web/20140803112320/http://i30www.ira.uka.de/~neider/edu/mkc/mkc.html#id2529997

---

---

plan 9's syscalls:

Plan 9 System Calls

" All of 'em (excluding obsolete ones).

Files open open an existing file create create a new file or open an existing file for writing pread read from an open file pwrite write to an open file chdir change current directory seek change the current position in an open file close close a file descriptor dup duplicate a file descriptor fd2path retrieve file name stat read file metadata fstat read open file metadata wstat write file metadata fwstat write open file metadata remove delete a file

Process management rfork change process attributes or create a new process exec replace the current process exits terminate the current process errstr exchange error string sleep sleep a given amount of time

Synchronization await wait for a child process to terminate pipe create a pipe rendezvous exchange a word of data semacquire acquire a semaphore semrelease release a semaphore

Memory management brk_ allocate memory segattach attach to or create a segment segdetach detach from a segment segfree free physical memory segbrk change segment length segflush flush cache

Namespace management mount mount a 9P connection bind bind a file or directory unmount unmount or remove a bind

9P connections fversion initialize a 9P connection fauth initiate authentication

Notes alarm set the alarm timer notify set the note handler noted continue after note " -- https://aiju.de/plan_9/plan9-syscalls

wow that does look simpler!

---

https://templeos.holyc.xyz/Wb/Doc/Welcome.html

http://www.codersnotes.com/notes/a-constructive-look-at-templeos/

---

http://www.r-5.org/files/books/computers/internals/unix/Francisco_Ballesteros-Notes_on_the_Plan_9_Kernel_Source-EN.pdf

---

" Almost any RTOS on the Cortex-M profile will fully utilize the NVIC, giving you a hardware-managed preemptive FIFO scheduler even in interrupt context. NuttX?'s lack of this feature is a severe weakness, and a major reason to avoid using it.

... (someone else's reply:)

I think this is a miconception about how RTOS's can work. Certainly in a bare-metal environment (or with a minimum RTOS), prioritized interrrupt-level processing is critical. But with RTOS's like NuttX?, interrupts should be very brief: You should enter the handler, perform mimimal housekeeping, and signal a prioritized thread to perform the interrupt-related operations.

This is not a oversight in the design, it is intentional replacement of prioritized interrupt handling processing with priorotized multi-tasking processing.

The NuttX? interrupt handling is designed so that the context switch after signaling the prioritized thread is ZERO. The return from interrupt vectors directly in the prioritized task (assuming it is the highest priority). This is all intentional design and is very efficent in these regards.

"

" There is no one-size, fits all RTOS. You find that some are too minimal and primitive to meet your needs (I am thinking FreeRTOS?, ChibiOS?, Zephyr, ...) and some may be heavier weight than is necessary for your purposes (perhaps NuttX?, nucleus, RTEMS, ...) or even heavier (VxWorks?, QNX, Integrity, and on up to Linux). "

---

the most popular foss embedded os's AFAICT:

so my revised list is:

in other words, out of the popular ones, we have:

minimum ROM sizes (i say 'k' but i mean KiB?):

so, the restriction of the above to things with <10k are:

Note that for some of my purposes this is somewhat deceptive, as a lot of the bulk in the others is probably device drivers that are irrelevant to me. Still, if i want to learn about minimalistic systems by example, it's easier to look at a small one rather than starting with a bigger system and decide for myself what is crucial.

google searching for freertos zephyr riot rt-thread uc/os-iii chibios and then removing them one by one:

so the ones to focus on for me are probably:

some docs:

https://www.freertos.org/Documentation/RTOS_book.html https://www.freertos.org/features.html https://www.freertos.org/simple-freertos-demos.html

https://docs.zephyrproject.org/latest/reference/kernel/index.html#kernel-api https://docs.zephyrproject.org/latest/reference/overview.html https://docs.zephyrproject.org/latest/samples/index.html https://docs.zephyrproject.org/latest/guides/index.html

https://github.com/RIOT-OS/RIOT/wiki http://riot-os.org/docs/riot-ieeeiotjournal-2018.pdf https://github.com/RIOT-OS/Tutorials/blob/master/README.md https://doc.riot-os.org/

(RT Thread Nano doesn't have its own good docs (at least in English, but it looks to me like the Chinese docs are also perfunctory) but a comparison of the diagrams in https://medium.com/@rt_thread/rt-thread-iot-os-nano-version-introduction-ea104773d4bf https://github.com/RT-Thread/rt-thread

shows that RT-Thread Nano appears to be RT-Thread restricted to the following components:

however according to https://github.com/RT-Thread/rtthread-manual-doc/blob/master/basic/basic.md even the normal RT-thread kernel only needs 3KB ROM so RT-thread Nano isn't that different from just RT-thread's kernel packaged as a separate distribution.

here is some docs of some of that (not all of it; i didnt include clock, interrupt, memory yet):

https://github.com/RT-Thread/rtthread-manual-doc/blob/master/basic/basic.md https://github.com/RT-Thread/rtthread-manual-doc/blob/master/thread/thread.md https://github.com/RT-Thread/rtthread-manual-doc/blob/master/thread-comm/thread-comm.md https://github.com/RT-Thread/rtthread-manual-doc/blob/master/thread-sync/thread-sync.md https://github.com/RT-Thread/rtthread-manual-doc https://github.com/RT-Thread/rtthread-nano/blob/master/rt-thread/include/rtthread.h https://github.com/RT-Thread/rtthread-nano/blob/master/rt-thread/include/rtdef.h https://github.com/RT-Thread/rtthread-nano/blob/master/rt-thread/include/rtlibc.h https://github.com/RT-Thread/rtthread-nano/blob/master/rt-thread/include/rtservice.h https://github.com/RT-Thread/rtthread-nano/tree/master/rt-thread/src https://github.com/RT-Thread/rtthread-nano https://www.cnx-software.com/2020/02/25/getting-started-with-rt-thread-iot-os-nano-rtos-on-risc-v-processors/ https://www.rt-thread.org/document/site/tutorial/nano/an0038-nano-introduction/ https://www.embedded-computing.com/news-releases/an-introduction-to-a-chinese-rt-thread-iot-os

---

other random small OSs:

http://www.femtoos.org/ " The Femto OS is a very concise portable real time - preemptive operating system (RTOS) for embedded microcontrollers with minimal ram and flash, say 2KB .. 16KB flash and 128 .. 1024 bytes ram. The main target is the Atmel AVR architecture, such as the ATtiny or smaller ATmega series. The OS runs well on larger hardware also. The system is written in C with a separate port file. Porting has been done for 44 AVR devices.

The typical footprint is between 1K and 4K flash, depending on the functions used. Somewhere around 2K for the OS itself is realistic figure for normal applications. The OS takes between 10 and 20 bytes of ram, tasks can take as little as 6 bytes of ram, but approximately 20 to 40 bytes is more realistic for real applications. Figures are including the stack. It is perfectly possible to run 4 or more tasks on an ATtiny261. There is no separate idle task and most api function calls run in OS space. The OS is interruptible most of the time, if needed. There are tools to watch and protect the use of the stack of the tasks and the OS. "

---

what is in the intersection of freertos, zephyr, riot, rt-thread nano (should i look at L4 too?)?

i think the thing to do is first make a list of components and then for each component make a list of actual functions (which might not map 1-1).

above i said:"RT-Thread Nano appears to be RT-Thread restricted to the following components:

this sounds like the sort of stuff that some other projects put all in a 'kernel' (and as noted above, RT-thread Nano maybe not be different from just RT-thread's kernel). Here's zephyr's 'kernel'

https://docs.zephyrproject.org/latest/reference/kernel/index.html#kernel-api

riot's kernel: https://doc.riot-os.org/group__core.html

https://www.freertos.org/wp-content/uploads/2018/07/161204_Mastering_the_FreeRTOS_Real_Time_Kernel-A_Hands-On_Tutorial_Guide.pdf https://www.freertos.org/wp-content/uploads/2018/07/FreeRTOS_Reference_Manual_V10.0.0.pdf

https://github.com/RT-Thread/rtthread-manual-doc/blob/master/basic/basic.md https://github.com/RT-Thread/rtthread-manual-doc/blob/master/thread/thread.md https://github.com/RT-Thread/rtthread-manual-doc/blob/master/thread-comm/thread-comm.md https://github.com/RT-Thread/rtthread-manual-doc/blob/master/thread-sync/thread-sync.md https://github.com/RT-Thread/rtthread-manual-doc

i haven't done this analysis yet, but those are probably the components and links to start with.

---

notes on https://github.com/RT-Thread/rtthread-manual-doc/blob/master/basic/basic.md (in progress):

clock management:

thread sync:

ipc:

TODO: read/take notes on the rest, starting at https://github.com/RT-Thread/rtthread-manual-doc/blob/master/basic/basic.md#memory-management

---

notes on https://weinholt.se/articles/non-posix-filesystems/

some features that Multics had:

Why is only the latest incremental backup required to get everything back in its place and working? The clever part is that the files might not yet be on the disk. In fact, most files will probably be on another backup medium. But the most recently used files have been restored, so you can most likely do useful work already, and all other files have their directory entries. If you try to open a file that hasn’t been restored then you’re asked to wait and the operator is told which tape to put in to get the file back. "

in the discussion on HN a user comments on the last feature:

"Nothing about this is “easily” once you work through the edge cases for performance and reliability...Your system can appear to work well with one test workload and then fail miserably when two people start running different tasks at the same time...I generally think this class of software is a mistake. Any time you misrepresent one class of storage as another it inevitably leads to very complex software which is still pretty fragile and confuses its users on a regular basis, and the cost savings never deliver to the hoped-for degree" [51]

a feature that Xerox Alto had:

    A file identifier
    A version number
    A page number (block index)
    A length
    A next link (pointer to the next block in the file)
    A previous link

The label means that each block is self-identifying. By my count this type of disk had an additional 14 bytes of label per each 512 byte block.

If the disk structure is damaged then a special program called the scavenger will iterate over all blocks and recover the structure. Contrast this to a file system like ext4 where a damaged structure can mean that huge swathes of data are rendered inaccessible. ... Why don’t we have this today? Nobody wants to build a POSIX file system that uses labels because the disks we’re using have 512-byte or 4096-byte blocks and no room for a label. A label would have a very small overhead for each block, but even a very small overhead like this is not acceptable to file system designers. More seriously, it would mess up mmap(2). ... The best modern alternative is something like Btrfs or ZFS. Blocks are checksummed and those checksums are stored next to the pointers in the disk structure. A modern version of the Alto file system should certainly have checksums, but not using labels gives weaker protections than what the Alto file system provided"

some features that Hydra had:

in the discussion on HN there is also a thread about Plan9:

https://news.ycombinator.com/item?id=24413383

quotes from that thread: "Basically, everything is a file. But to a ridiculous extent, far beyond what you'd normally think of as files...For example, you can connect computers together via networks using the equivalent of `cat`. (And yes, we have netcat, but it's not quite the same thing as having the abstraction built into the OS.) ... If you implement, for example, the network stack as a filesystem, and you have generic facilities which let you bind "files" from one system to another, you can forward your traffic through another host as easily as:

    import remote-host /net

Now your new network connections will go through remote-host, because a network connection is just reads and writes to files in /net. ...

> Now your new network connections will go through remote-host, because a network connection is just reads and writes to files in /net.

Wouldn't this then also affect the connection to this remote host used for importing, effectively looping itself?

...

Nope, because the connection to that remote host was established when we were using the local host's /net. This is one of the other key features of Plan 9: per-process namespaces (inherited by children). Only programs inheriting the current process's namespace will use the imported network stack: you open a shell, you run "import remote /net" to bring in the remote network stack, then you run your network applications within that shell. Applications in any other shell will use the regular local network stack. After importing /net from the remote host, you could also run `rio` in that shell, which will start a new instance of the GUI nested within that window, in which all applications will use the imported stack.

...

, everything was a file, down to the graphical UI. This allowed you to interact with the GUI trough the file system, script things, etc. ... I wish Linux used the filesystem that extensively, to replace things such as d-bus, pulseaudio, gobject, gstreamer, wayland, etc. GNU Hurd has a similar, very interesting mechanism, translators [3] (basically first-class FUSE objects).

...

Here's how you get a graphical session on a remote machine: export your local /dev/draw, /dev/mouse, etc. via 9P to the remote machine, then run the GUI on that remote machine. It'll write to /dev/draw, read from /dev/mouse, and so on with no idea that all those file operations are actually traversing the network.

How did Plan 9 allow its APIs to evolve? For example, reading /dev/mouse returns records that are exactly 49 bytes, in text format [1]; how did they add new fields without breaking every app? ... They’d probably have /dev/mouse and /dev/mouse2 simultaneously. ... namespace

the HN discussion mentions a feature that WAFL and NILFS had:

"One of the cool things about WAFL (the Write Anywhere File Layout) system that NetApp? used (uses?) was that it's very design made snapshots 'trivial' since every write to disk was to un-allocated blocks. What that meant in practice was that the file system on disk was always sane. This was what let you pull the power from the box at any time, and assuming its non-volatile RAM was still available, it could always recover. Something that you could do with Intel Cross Point memory and a bunch of disks."

the HN discussion mentions a feature that MTS had:

"...the data model was not stream-of-bytes but rather line-numbered-record file...The whole point of the Unix FS was that the developers felt the mainframe approach (~"files are actually sorta like databases") ((w))as to complex and heavyweight."

and i would also add: what about more database-like filesystems like Microsoft was supposedly working on all those years ago and gave up on?

here's one: https://en.wikipedia.org/wiki/Soup_(Apple)

---

 	bshanks (1812) | logout
	
	Animats 6 hours ago 
parent flag favorite on: Non-Posix File Systems

I've wanted somewhat different file system semantics, but close to the POSIX model:

Someone 5 hours ago [–]

”Log files. Append-only mode, enforced. Can't seek back and overwrite. On a crash, the file recovers to some recent good write, with a correct end of file position. It does not tail off into junk.“

As wahern says, append-only is available in some OSes. I am not aware of any that guarantees the “recovers to some recent good write” part, though. UNIX-like OSes won’t support it, as they don’t have a way to write more than one byte atomically (https://en.wikipedia.org/wiki/Write_(system_call): The write function returns the number of bytes successfully written into the array, which may at times be less than the specified nbytes)

reply

LgWoodenBadger? 5 hours ago [–]

One feature that I’ve wished for many times is a transactional filesystem, for metadata and the actual contents itself.

reply

inopinatus 5 hours ago [–]

There are have been attempts at "phase tree" filesystems with essentially transactional write semantics for data and metadata simultaneously. Tux2 was one such nascent attempt that ended (rightly or wrongly) due to patent conflict with Netapp's WAFL [1].

The interesting thing is, I just took a look at what I believe were the patents in conflict, and they appear to have expired in 2015 [2,3]. Combining the phase tree concept with, say, Merkle trees might open up very interesting possibilities for large-scale reliable file storage.

skissane 4 hours ago [–]

Windows has transactional NTFS. It was never widely adopted, in part because it requires use of a whole different filesystem API (CreateFileTransacted? instead of CreateFile?, etc). I think, if it had been more seamless, it may have been more successful. Given it is rarely used, Microsoft has marked it as deprecated for potential removal in future Windows versions.

reply

asveikau 1 hour ago [–]

Iirc it was implemented with a pretty seamless API at the undocumented ntdll layer, which let you set a current transaction on a handle then do API calls as usual.

The issue is that while NT mostly reasons about files as handles, Win32 deals primarily with the name for many operations. So they duplicated every api that took a filename.

reply

nitrogen 6 hours ago [–]

File replacement can be done now through non-portable renaming gyrations.

I was under the impression that POSIX requires renames to be atomic, and have relied on that in embedded Linux systems with UBIFS. Are there exceptions among truly compliant filesystems and OSes, and/or is the atomicity guarantee not as strong as I thought?

reply

wahern 5 hours ago [–]

It's atomic but it might not be persistent in the event of a crash. If the metadata wasn't committed then upon remounting the old file might reappear, and its likelihood of reappearing is independent of any other operations on that filesystem. Though, traditionally rename was not only atomic but also preferentially ordered ahead of other metadata operations so subsequent renames of other files wouldn't be visible if the older rename appeared.

EDIT: Actually, I think the issue I had in mind was that writes to the new file might not be committed before the rename, so if you do open + write + rename + close, on a crash the rename might persist even though the write didn't. You technical should use fdatasync if you want the write + rename to be ordered. See, e.g., https://lwn.net/Articles/322823/ But this is such unintuitive behavior that I think that even for ext4 the more reliable behavior is still the default.

reply

jacobsenscott 5 hours ago [–]

Isn't this true of any atomic operation in the event of a crash? You'll either see the old data or the new data when the system comes back up.

reply

ori_b 30 minutes ago [–]

The alternative would not allow for a posix-compliant implementation ramfs.

reply

---

http://www.barrelfish.org/documentation.html

---

http://man.9front.org/1/rc

https://plan9.io/sys/doc/acidpaper.html

---

"Newton soups: no need for a filesystem when you have a system-wide graph database."

---

"Expand on the idea behind Erlang and processes having a common messaging behavior. Almost everythin becomes “gen_server” like. I know Plan 9 has this idea of everything as a file server, but abstract that a little bit and create the same interface across the board."

c-cube 4 days ago

link flag

If there’s a thing to take from Erlang, I think it’d be to use structured data (tuples, atoms, ints, strings/bitstrings) in messages, not just a stream of bytes like unix does. I think an OS where programs communicate via Erlang’s data structures would be much better and richer than anything built on unstructured bytes.

---

" Functional programming methodology in tooling. What I mean by this pretty terrible phrase is something like the command line in pipes but as a language across the system. Complex data flows (streams, bytes, strings, objects, etc) enter tools and I can process and filter them and produce something else. Almost like IFTT and pf at a system level. This builds on the “everything is a gen_server” idea above. Elevate the terminal and “user facing” commands more. Think Alfred, Quicksilver, but it is the terminal and user friendly so my mother-in-law used it. “open” is a tool that acts on data, that data might be an application, an image, a URL, etc. Again, think functional. :) Finally make it so that things like dtrace/ebpf are just standard behavior, not these complex hooks into a system, but a part of the standard method for writing process, that they naturally produce that info that these types of tools can dig into. Oh one more thing, packages/package manager should isolate all installable items (think Nix). And don’t choose some special “bucket”, use standards like tar/gzip. I don’t want to have to have special tools to peer into some package. "

---

8 sujayakar 4 days ago

link flag

I would love to see an OS designed for making testing and debugging user-space as effortless as possible.

First, it’d be incredible to have something like rr embedded within the OS. Processes could be run to be fully deterministically, with OS nondeterminism like scheduling decisions being scriptable or pseudorandomized by the test driver. We’d have no more flaky tests, and the OS would actively cooperate with programmers for finding concurrency bugs where the program relies on a particular schedule.

Then, with user-space fully determinized, we can rely more on property-testing (a la QuickCheck?) for easily improving program correctness. Current fuzzers in this space do sophisticated binary analysis and use different hardware counters for exploring different program paths, and the OS is in the unique position of being able to abstract over these. I’m imagining something like SAGE being offered as an operating system service. Then, your program is deterministic by default, and you can ask the operating system for test data that it intelligently synthesizes with binary path exploration techniques!

Finally, with all of these features, the OS could also provide an amazing debugger that allows developers to time travel through program executions and visualize their results. The Pernosco folks have done some really cool work exploring what debugging can look like with deterministic replay.

https://lobste.rs/s/jr4crd/what_should_new_os_have#c_ssqrkc (there were hyperlinks that i didn't copy over, go look at the original comment for those)

---

" Kill the VT100. Command lines shouldn’t be limited by curses, but display rich objects too. This complements what people are talking about with structured IPC. Emacs shows ideas of how this can work, albeit in a crude manner. If your system has consistent commands (or some kind of reflection capabilities on them), I should have IDE-like prompting of arguments and values. Tab complete is insufficient. IBM i has F4 to prompt for values anywhere, imagine how consistent that could be if it were like Visual Studio. Some kind of way for applications to communicate better. Scripting-wise, AppleScript? is probably the most popular example that works globally, though VBA is more popular for application-local. AppleScript? had a bad language, but the model was incredibly powerful. If end-users want to program their computer, this is how they do it. Enrich drag-and-drop. RISC OS puts such emphasis on it that it’s basically interactive pipelines. Rich embedability in documents. Remove distinctions between different types of documents or even folders. See BTRON, OLE on Windows, ViewPoint?. "

---

Van Jacobson's network channels

https://lwn.net/Articles/169961/

the basic idea is: reduce locking, copying, and context switches by:

reminds me a little of the uring stuff b/c of the lockless queue communication between kernel and user program

---

https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

---

    ws, ViewPoint.

7 dbremner 4 days ago

link flag

Here are some of the features I would like to have in a new OS. I have listed OSes that implemented a feature in bold.

    File locking and record locking are well-behaved.
    It’s not the default but real-time scheduling is available. IRIX
    Networking is zero-copy and uses Van Jacobson’s network channels.
    Instrumentation such as DTrace or ETW is pervasive and it’s easy to add it to applications. Solaris
    Commands use regular lexemes, e.g. every command to create an object starts with crt. i/OS
    Command line options use consistent arguments, e.g. -v is always verbose.
    The shell can complete command line options; applications have metadata that indicates what options are permitted.
    Bounds checking is cheap and everything uses it. SOS
    Overflow checking is cheap and everything uses it.
    There’s a pragma to disable it for hash functions and such.
    File associations use MIME types. BeOS
    Checkpointing and session recovery is provided as a library. NOS
    It’s easy to add support for it to applications.
    You can stop an application on one computer and resume it on a different one.
    This is an evolved version of dropfiles.
    APIs come with static analyzers to identify issues.
    APIs come with testcases, e.g. you can run your application in a mode where memory allocations randomly fail.
    Filenames are case-insensitive, case-preserving, and limited to printable characters.
    When was the last time you had filename.txt and Filename.txt in the same directory?
    Applications can be run in a debugging mode where system calls log the address of the caller, arguments, and the timestamp in a circular buffer.
    APIs are versioned and use semantic versioning.
    Old versions are removed on a regular schedule to reduce technical debt.

-- https://lobste.rs/s/jr4crd/what_should_new_os_have#c_oriefp (there were hyperlinks that i didn't copy over, go look at the original comment for those)

---

Little Things That Made Amiga Great

https://datagubbe.se/ltmag/

---

https://genode.org/

---

" Despite running html5zombo.com for over 10 years now, I’m not sure I really appreciated what SVGs were capable of until I built this one. They can embed images? CSS? Javascript? Any site that allows users to upload arbitrary SVGs and renders them now has my utmost respect. "

---

serenityOS

https://awesomekling.github.io/I-quit-my-job-to-focus-on-SerenityOS-full-time/

--- barrelfish

http://www.barrelfish.org/ http://www.barrelfish.org/documentation.html http://www.barrelfish.org/publications/TN-000-Overview.pdf

---

https://unikraft.org/

---

https://mars-research.github.io/redleaf

--- historical

http://www.edm2.com/index.php/The_OS/2_API_Project

https://www.informit.com/articles/article.aspx?p=24972

---

https://kolibrios.org/en/

" KolibriOS? is a tiny yet incredibly powerful and fast operating system. This power requires only a few megabyte disk space and 8MB of RAM to run. Kolibri features a rich set of applications that include word processor, image viewer, graphical editor, web browser and well over 30 exciting games. Full FAT12/16/32 support is implemented, as well as read-only support for NTFS, ISO9660 and Ext2/3/4. Drivers are written for popular sound, network and graphics cards.

Have you ever dreamed of a system that boots in less than few seconds from power-on to working GUI? Applications that start instantly, immediately after clicking an icon, without annoying hourglass pointers? This speed is achieved since the core parts of KolibriOS? (kernel and drivers) are written entirely in FASM assembly language! Try Kolibri and compare it with such heavyweights as Windows and Linux.

KolibriOS? has forked off from MenuetOS? in 2004, and is run under independent development since then. All our code is open-source, with the majority of the code released under GPLv2 license. Your feedback is very much appreciated, and your help is even more welcome. "

---

https://arcan-fe.com/2021/09/20/arcan-as-operating-system-design/

---

https://ngnghm.github.io/

---

https://redox-os.org/

---

kerla

https://github.com/nuta/kerla

---

afr0ck 3 hours ago [–]

Why do people implicitly assume that micro kernels are the best OS design?

In fact, micro kernels have a lot of issues and limit the possibilities of what an OS can do! One of the most important issues is they prevents resource sharing in a rich and efficient way between the components of the kernel. They make building rich relationships between kernel data structures and kernel subsystems very messy, complicated, impracticable and very inefficient.

Micro kernels are delusional and people should stop implicitly assuming their superiority.

reply

PaulDavisThe?1st 2 hours ago [–]

This is a limitation of the memory model, not the microkernel. If you had a single hardware-protected address space, kernel modules could share resources freely.

But this does argue against message passing, since such systems preferentially try to use messages over sharing whenever it's possible.

reply

sumtechguy 2 hours ago [–]

Message passing at first seems interesting and neat and useful. It is for some cases. But it comes at a cost of complexity and latency.

reply

---

PaulDavisThe?1st 2 hours ago [–]

Mach's internals didn't follow the Unix model, and was by almost any metric a failure.

There have been several other "real micro kernel based OS based on a message passing core", and they too, by almost any metric, have been a failure.

Process scheduling in Linux in 2021 is far, far ahead of anything in any other OS, ukernel or MP-based. The idea that there hasn't been progress in this area is really completely absurd. In fact, the general idea that Linux follows "the now 50 year old Unix architecture" is also pushing the boundaries quite a lot.

reply

dnautics 2 hours ago [–]

> "real micro kernel based OS based on a message passing core", and they too, by almost any metric, have been a failure.

BeOS? wasn't half bad. The failure was probably mostly commercial. It is true that they moved their networking stack into the kernel, but it's not entirely clear to me that they had to do that, it was just the most expedient way to get acceptable performance and stability given the (programmer-time) resourcing constraints of the project.

reply

PaulDavisThe?1st 1 hour ago [–]

> BeOS? wasn't half bad. The failure was probably mostly commercial.

Fair point. You could probably say this of other attempted microkernels too, to be even fairer. But that doesn't really change the fundamental point that just cooking up "a better OS design" doesn't lead to it being successful. There's a lot more in play that "mere quality".

reply

---

https://github.com/nuta/kerla

---

https://github.com/klange/toaruos

---

14 crazyloglad 22 hours ago

link flag

I keep my vote on QEmu(KVM) + Qcow2 as the non-distribution sanctioned packaging and sandboxing runtime. Once ironically, now as faint hope. The thing is, the packaging nuances matters little in the scale of things, heck - Android got away with a bastardised form of .zip and an xml file.

The runtime data interchange interfaces are much more important, and Linux is in such a bad place that even the gargantuan effort Android 5.0 and onwards applied trying to harden theirs to a sufficient standard wouldn’t be enough. No desktop project has the budget for it.

You have Wayland as the most COM-like being used (tiptoeing around the ‘merits’ of having an asynchronous object oriented IPC system) – but that can only take some meta-IPC (data-device), graphics buffers in only one direction and snaps like a twig under moderate load. Then you attach Pipewire for bidirectional audio and video, but that has no mechanism for synchronisation or interactive input, and other ‘meta’ aspects. Enter xdg-desktop-portal as a set of D-Bus interfaces.

Compared to any of the aforementioned, Binder is a work of art.

    5
    x64k 6 hours ago | link | flag | 

Indeed, it seems to me that a lot of companies that jumped on the sandboxing bandwagon have missed a critical point that the sandboxing systems used in the mobile world didn’t. Sandboxing is a great idea but it’s not going to fly without a good way to talk to the world outside the sandbox. Without one, you either get a myriad incompatible and extremely baroque solutions which so far are secure by obscurity and more often than not through ineffectiveness (Wayland), or a sandboxing system that everyone works around in order to keep it useful (Flatpak – an uncomfortable amount of real-life programs are ran with full access to the home dir, rendering the sandbox practically useless as far as user applications are concerned).

Our industry has a history of not getting these second-order points. E.g. to this day I’m convinced that the primary reason why microkernels are mostly history is that everyone who jumped on the microkernel bandwagon in the 1990s poured a lot of thought into the message passing and ignored the part where the performance will always be shit if the message passing system and the scheduler aren’t integrated. QNX got it and is the only one that enjoyed some modicum of long-term success.

I’m not convinced we’re going to get the sandboxing part right, either. While I’m secretly hoping that a well-adapted version of KVM + QCow2 is going to win, what I think is more likely to happen is that the two prevailing operating systems today will just retrofit a sandboxing system from iOS, or a clone of it, and dump all “legacy apps” into a shared “legacy sandbox” that’s shielded from everything except a “Legacy Document” folder or whatever.

"

[52]

---

https://phoboslab.org/log/2021/11/qoi-fast-lossless-image-compression

sbrki 4 hours ago

next [–]

Very nice.

Regarding the comments about complexity of existing formats - I agree, with the exception of PNG. In terms of complexity, QOI is very similar to PNG filtering phases, i.e. when you strip primary compression phase (Huffman coding) from PNG.

Also, I admire how the author combined a lot of common and cool tricks into QOI:

Also, the fact that everything is byte aligned is very appreciated.

This is a breath of fresh air and really impressive how good performance (both processing and size reduction) this gets with so little (very readable!) C.

reply

dahart 1 hour ago

prev next [–]

This is pretty neat! The compression rates and encode/decode rates are impressive for such a simple rule set.

> SIMD acceleration for QOI would also be cool but (from my very limited knowledge about some SIMD instructions on ARM), the format doesn't seem to be well suited for it. Maybe someone with a bit more experience can shed some light?

I don’t know about SIMD encoding, but for decoding at least, the core problem is to make some unit of the image (pixels, scanlines, blocks, etc) independently addressable (readable). The current format is complicated due to pixels of different sizes; there’s no way to know where data for pixel 1000 starts without reading all preceding 999 pixels. You could build a prefix-sum array of pixel indices, but that will fully undermine your compression.

One question might be how well does QOI work on independent scan lines or blocks? Is the compression still pretty good if the look-back window is limited in size and restarts every once in a while?

To make a GPU capable format, the very first thing you would probably need to do is separate the tag bits from all the other data, so a thread could identify the pixel type by direct lookup into the tag buffer. It then needs to know where to go to find the remaining data, so you’d probably need at least one byte of data that could be used to reliably locate the rest of the data. I don’t see an easy way to make this happen for separate pixels, but it might not be bad if the image is encoded in blocks or scanlines.

reply

willvarfar 2 hours ago

parent prev next [–]

I believe (from quick code inspection) that the symmetry in encode/decode performance for QOI is because it has to generate the hash-table on decode too.

Normal fast encoders use some kind of hash table or search to find matches, but encode the offset of the match in the source rather than in the table. QOI is encoding the offset into the table, which gives much shorter offsets but means the decoder has to maintain the table too.

(The slow PPM compressors etc do maintain the same table in both encoder and decoder, and have symmetrical encode/decode performance too. See http://www.mattmahoney.net/dc/text.html)

Its not really in the spirit of of QOI to add the complexity, but I'd imagine allowing the encoder to specify how big the hash table was would be only a small tweak, and encoding literal blocks instead of pixel-by-pixel will improve handling input that QOI can't compress better.

    ~
    river 1 hour ago (unread) | link | flag | 
on: Lossless Image Compression in O(n) Time

This is great! I would suggest one way to go even simpler: Use some raw file format for the images, and use a generic file compression afterwards. There isn’t really any reason why compression and image file format need to be coupled.

    ~
    chuck_masta_grep 41 minutes ago | link | flag | 

You may be interested in farbfeld from the Suckless folks, which is pretty much exactly what you are suggesting.

---

a new one; oxide's Hubris (Rust) https://oxide.computer/blog/hubris-and-humility

panick21_ 11 hours ago

prev next [–]

We are gettig an increasing amount of interesting Rust operating system for different uses.

I assume there are more.

reply

im_down_w_otp 7 hours ago

parent next [–]

We also built a Rust framework called FerrOS? (https://github.com/auxoncorp/ferros) atop the formally-verified seL4 microkernel.

It has a similar set of usage idioms to Hubris it looks like in terms of trying to setup as much as possible ahead of time to assemble what's kind of an application specific operating system where everything your use case needs is assembled at build-time as a bunch of communicating tasks running on seL4.

We recently added a concise little persistence interface that pulls in TicKV? (https://docs.tockos.org/tickv/index.html) from the Tock project you referenced above, and some provisions are being added for some more dynamic task handling based on some asks from an automotive OEM.

Also, sincerest h/t to the Tock folks.

reply

---

TD-Linux 9 hours ago

prev next [–]

I have an embedded real-time control project that is currently written in Rust, but runs with RTIC (https://rtic.rs/), a framework which is conceptually similar (no dynamic allocation of tasks or resources) but also has some differences. RTIC is more of a framework for locks and critical sections in an interrupt based program than a full fledged RTOS. Looking through the docs, here's the main differences (for my purposes) I see:

1. In Hubris, all interrupt handlers dispatch to a software task. In RTIC, you can dispatch to a software task, but you can also run the code directly in the interrupt handler. RTIC is reliant on Cortex-M's NVIC for preemption, whereas Hubris can preempt in software (assuming it is implemented). This does increase the minimum effective interrupt latency in Hubris, and if not very carefully implemented, the jitter also.

2. Hubris compiles each task separately and then pastes the binaries together, presumably with a fancy linker script. RTIC can have everything in one source file and builds everything into one LTO'd blob. I see the Hubris method as mostly a downside (unless you want to integrate binary blobs, for example), but it might have been needed for:

3. Hubris supports Cortex-M memory protection regions. This is pretty neat and something that is mostly out of scope for RTIC (being built around primitives that allow shared memory, trying to map into the very limited number of MPU regions would be difficult at best). Of course, it's Rust, so in theory you wouldn't need the MPU protections, but if you have to run any sort of untrusted code this is definitely the winner.

Hubris does support shared memory via leases, but I'm not sure how it manages to map them into the very limited 8 Cortex-M MPU regions. I'm quite interested to look at the implementation when the source code is released.

Edit: I forgot to mention the biggest difference, which is that because tasks have separate stacks in Hubris, you can do blocking waits. RTIC may support async in the future but for now you must manually construct state machines.

reply

monocasa 8 hours ago

parent next [–]

> Hubris does support shared memory via leases, but I'm not sure how it manages to map them into the very limited 8 Cortex-M MPU regions.

What I did in a similar kernel was dynamically map them from a larger table on faults, sort of like you would with a soft fill TLB. When you turn off the MPU in supervisor mode you get a sane 'map everything' mapping, leaving all 8 entries to user code.

The way LDM/STM restart after faults is amenable to this model on the M series cores.

reply

---

https://www.lynx.com/embedded-systems-learning-center/most-popular-real-time-operating-systems-rtos

---