notes-computer-systems-greenField

Difference between revision 30 and current revision

No diff available.

OS

"To be a viable computer system, one must honor a huge list of large, and often changing, standards: TCP/IP, HTTP, HTML, XML, CORBA, Unicode, POSIX, NFS, SMB, MIME, POP, IMAP, X, ... A huge amount of work, but if you don’t honor the standards you’re marginalized. Estimate that 90-95% of the work in Plan 9 was directly or indirectly to honor externally imposed standards." -- http://herpolhode.com/rob/utah2000.pdf (Rob Pike talk)


MS singularity:


Plan 9 notes:

syscalls from http://man.cat-v.org/plan_9/2/intro :

bind open read dup fork seek stat remove fd2path getwd pipe exec chdir segattach exits wait sleep notify lock errstr

9P protocol messages from http://man.cat-v.org/plan_9/5/intro :

version auth error flush attach walk open create read write clunk remove stat wstat

version auth error flush clunk are protocol metastuff (flush aborts a transaction, clunk deallocates a file descriptor), leaving:

ddevault on Jan 28, 2018 [-]

The kernel is simple and sanely designed and interfacing with it is done through the filesystem in, again, a very simple and sane way. Your displays are configured by a simple <10 line shell script at boot time that writes plaintext to a device file, rather than a gargantuan graphics stack that's controlled with binary device files and ioctls. Filesystem namespaces are lovely, and make the filesystem cleanly organized and customized to the logged in user's needs and desires. I have a little script that clears my current terminal window: `echo -n "" > /dev/text`. The libc is small and focused (non-POSIX), and there are only a handful of syscalls. The process model is really sane and straightforward as well. Playing an mp3 file is `mp3dec < file.mp3 > /dev/audio`.

To open a TCP connection, you use the dial(3) function, which basically does the following: write "tcp!name!port" to /net/cs and read out "1.2.3.4!80" (/net/cs resolves the name), then you write "connect 1.2.3.4" to /net/tcp/clone and read out a connection ID, and open /net/tcp/:id/data which is now a full-duplex TCP stream.

There's this emphasis on simple, sane ways of fulfilling tasks on plan9 that permeates the whole system. It's beautiful.

https://thedorkweb.substack.com/p/a-week-with-plan-9


https://lwn.net/SubscriberLink/718267/206c8a5fbf0ee2ea/ https://news.ycombinator.com/item?id=14002386

" Fuchsia: a new operating system

Nur Hussein

Fuchsia is a new operating system being built more or less from scratch at Google. ... At the heart of Fuchsia is the Magenta microkernel... LK, the kernel that Magenta builds upon, was created by Fuchsia developer Travis Geiselbrecht before he joined Google. LK's goal is to be a small kernel that runs on resource-constrained tiny embedded systems (in the same vein as FreeRTOS? or ThreadX?). Magenta, on the other hand, targets more sophisticated hardware (a 64-bit CPU with a memory-management unit is required to run it), and thus expands upon LK's limited features. Magenta uses LK's "inner constructs", which is comprised of threads, mutexes, timers, events (signals), wait queues, semaphores, and a virtual memory manager (VMM). For Magenta, LK's VMM has been substantially improved upon.

One of the key design features of Magenta is the use of capabilities....Capabilities are implemented in Magenta by the use of constructs called handles....Almost all system calls require that a handle be passed to them. Handles have rights associated with them... The rights that can be granted to a handle are for reading or writing to the associated kernel object or, in the case of a virtual memory object, whether or not it can be mapped as executable....Since memory is treated as a resource that is accessed via kernel objects, processes gain use of memory via handles. Creating a process in Fuchsia means a creator process (such as a shell) must do the work of creating virtual memory objects manually for the child process. This is different from traditional Unix-like kernels such as Linux, where the kernel does the bulk of the virtual memory setup for processes automatically. Magenta's virtual memory objects can map memory in any number of ways, and a lot of flexibility is given to processes to do so. One can even imagine a scenario where memory isn't mapped at all, but can still be read or written to via its handle like a file descriptor. While this setup allows for all kinds of creative uses, it also means that a lot of the scaffolding work for processes to run must be done by the user-space environment.

Since Magenta was designed as a microkernel, most of the operating system's major functional components also run as user-space processes. This include the drivers, network stack, and filesystems. The network stack was originally bootstrapped from lwIP, but eventually it was replaced by a custom network stack written by the Fuchsia team. The network stack is an application that sits between the user-space network drivers and the application that requests network services. A BSD socket API is provided by the network stack.

The default Fuchsia filesystem, called minfs, was also built from scratch... since the filesystems run as user-space servers, accessing them is done via a protocol to those servers....The user-space C libraries make the protocol transparent to user programs, which will just make calls to open, close, read, and write files. ... Full POSIX compatibility is not a goal for the Fuchsia project; enough POSIX compatibility is provided via the C library, which is a port of the musl project to Fuchsia..."

https://fuchsia.googlesource.com/docs/+/master/the-book/

Khaine 13 hours ago [-]

One thing I don't see addressed in the README is why? Why do we need Fuchsia? What problem are we trying to solve? Why should I use/develop for it instead of Windows/Linux/macOS?

Or is this just a research operating system designed to test new ideas out?

reply

akavel 12 hours ago [-]

The main keywords are "capability based" and "microkernel". Those ideas bring numerous advantages over monolithic kernels (including Linux, Windows, macOS), especially humongous boost to protection against vulnerabilities, also better reliability and modularity. They are quite well researched already AFAIU, and apparently the time has come for them to start breaking through to "mainstream" (besides Fuchsia, see e.g. https://genode.org, https://redox-os.org)

Other than that, for Google this would obviously bring total control over the codebase, allowing them to do whatever they want, and super quickly, not needing to convince Linus or anybody else.

reply

naasking 21 hours ago [-]

Some problems I see from skimming the docs:

> Calls which have no limitations, of which there are only a very few, for example zx_clock_get() and zx_nanosleep() may be called by any thread.

Having the clock be an ambient authority leaves the system open to easy timing attacks via implicit covert channels. I'm glad these kinds of timing attacks have gotten more attention with Spectre and Meltdown. Capability security folks have been pointing these out for decades.

> Calls which create new Objects but do not take a Handle, such as zx_event_create() and zx_channel_create(). Access to these (and limitations upon them) is controlled by the Job in which the calling Process is contained.

I'm hesitant to endorse any system calls with ambient authority, even if it's scoped by context like these. It's far too easy to introduce subtle vulnerabilities. For instance, these calls seem to permit a Confused Deputy attack as long as two processes are running in the same Job.

Other notes on the kernel:

Looks like they'll also support private namespacing ala Plan 9, which is great. I hope we can get a robust OS to replace existing antiquated systems with Google's resources. This looks like a good start.

reply


overview from the Lisp Machine manual http://lispm.de/genera-concepts

---

Networking

Associated VMs


L4

" In this spirit, the L4 microkernel provides few basic mechanisms: address spaces (abstracting page tables and providing memory protection), threads and scheduling (abstracting execution and providing temporal protection), and inter-process communication (for controlled communication across isolation boundaries).

An operating system based on a microkernel like L4 provides services as servers in user space that monolithic kernels like Linux or older generation microkernels include internally. For example, in order to implement a secure Unix-like system, servers must provide the rights management that Mach included inside the kernel. " [1]

" Microkernels minimize the functionality that is provided by the kernel: The kernel provides a set of general mechanisms, while user-mode servers implement the actual operating system (OS) services [Brinch Hansen 1970; Levin et al. 1975]. Application code obtains a system service by communicating with servers via an interprocess com- munication (IPC) mechanism, typically message passing. Hence, IPC is on the critical path of any service invocation, and low IPC costs are essential.

By the early 1990s, IPC performance had become the achilles heel of microkernels: The typical cost for a one-way message was around 100us, which was too high for building performant systems. This resulted in a trend to move core services back into the kernel [Condict et al. 1994]. ... Twenty years ago, Liedtke [1993a] demonstrated with his L4 kernel that microkernel IPC could be fast, a factor 10–20 faster than contemporary microkernels. " [2]

" The asynchronous in-kernel-buffering process communication concept used in Mach turned out to be one of the main reasons for its poor performance. ... Detailed analysis of the Mach bottleneck indicated that, among other things, its working set is too large: the IPC code expresses poor spatial locality; that is, it results in too many cache misses, of which most are in-kernel.[3] This analysis gave rise to the principle that an efficient microkernel should be small enough that the majority of performance-critical code fits into the (first-level) cache (preferably a small fraction of said cache).

...

Instead of Mach's complex IPC system, (Jochen Liedtke's) L3 microkernel simply passed the message without any additional overhead.

...

After some experience using L3, Liedtke came to the conclusion that several other Mach concepts were also misplaced. By simplifying the microkernel concepts even further he developed the first L4 kernel which was primarily designed with high performance in mind. In order to wring out every bit of performance the entire kernel was written in assembly language, and its IPC was 20 times faster than Mach's. ... " https://en.wikipedia.org/wiki/L4_microkernel_family

"...all of them have a scheduler in the kernel, which implements a particular scheduling policy (usually hard-priority round robin)." [3]

L4 IPC

"We mentioned earlier the importance of IPC performance and that the design and im- plementation of L4 kernels consistently aimed at maximising it. However, the details have evolved considerably.

3.2.1. Synchronous IPC. The original L4 supported synchronous (rendezvous-style) IPC (((the sender blocks))) as the only...mechanism. Synchronous IPC avoids buffering in the kernel and the management and copying cost associated with it. In fact, in its simplest version (short messages passed in registers) it is nothing but a context switch that leaves the message registers untouched...typical L4 implementations have IPC costs that are only 10% to 20% above the hardware limit (defined as the cost of two mode switches, a switch of page tables, plus saving and restoring addressing context and user-visible processor state)...

While certainly minimal, and simple conceptually and in implementation, experience taught us significant drawbacks of this model: It forces a multithreaded design onto otherwise simple systems, with the resulting synchronisation complexities. For example, the lack of functionality similar to UNIX select() required separate threads per interrupt source, and a single-threaded server could not wait for client requests and interrupts at the same time.

Furthermore, synchronous message passing is clearly the wrong way of synchronising activities across processor cores. On a single processor, communication between threads requires that (eventually) a context switch happens, and combining the context switch with communication minimises overheads. Consequently, the classical L4 IPC model is that of a user-controlled context switch that bypasses the scheduler; some payload is delivered through nonswitched registers, and further optional payload by kernel copy. On hardware that supports true parallelism, an RPC-like server invocation sequentialises client and server, which should be avoided if they are running on separate cores... " [4]

notifications:

" We addressed this in L4-embedded by adding notifications , a simple, nonblocking signalling mechanism. We later refined this model in seL4’s notification objects

A no- tification contains a set of flags, the notification word , which is essentially an array of binary semaphores. A signal operation on a notification object sets a subset of the flags without blocking. The notification word can be checked by polling or by waiting (blocking) for a signal—effectively select() across the notification word.

Our present design provides another feature aimed at reducing the need for multi- threaded code, unifying waiting for IPC and notifications. For example, a file server might have an IPC interface for client requests, as well as a notification used by the disk driver to signal I/O completion. The unification feature binds a notification ob- ject to a thread (the server of the above example). If a notification is signalled while the thread is waiting for a message, the notification is converted into a single-word message and delivered to the thread (with an indication that it is really a notification). Notifications are not an introduction of asynchronous IPC through the backdoor but rather a (partial) decoupling of synchronisation from communication. While strictly not minimal (in that they add no functionality that could not be emulated with other mechanisms), they are essential for exploiting concurrency of the hardware.

In summary, like most other L4 kernels, seL4 retains the model of synchronous IPC but augments it with semaphore-like notifications. OKL4 has completely abandoned synchronous IPC and replaced it with virtual IRQs (similar to notifications). NOVA has augmented synchronous IPC with counting semaphores [Steinberg and Kauer 2010], while Fiasco.OC has also augmented synchronous IPC with virtual IRQs. " [5]

virtual registers:

early L4 versions passed messages in CPU registers. But "Pistachio introduced the concept of virtual message registers (originally 64 and later a configuration option). The implementation mapped some of them to physical reg- isters, and the rest was contained in a per-thread pinned part of the address space. The pinning ensures register-like access without the possibility of a page fault. Inlined access functions hide the distinction between physical and memory-backed registers from the user. seL4 and Fiasco.OC continue to use this approach. The motivation is two-fold: virtual message registers greatly improve portability across architectures. Furthermore, they reduce the performance penalty for moder- ately sized messages exceeding the number of physical registers " [6]

long messages (dropped):

" In original L4, “long” messages could specify multiple buffers in a single IPC invo- cation to amortise the hardware mode- and context-switch costs...required the kernel to handle nested exceptions...in practice it was rarely used: Shared buffers can avoid any explicit copying between address spaces and are generally preferred for bulk data transfer..." [7]. The need for nested exception handling (page fault exceptions, actually) in the kernel would have made verification of seL4 much harder.

IPC Destinations (dropped):

" Original L4 had threads as the targets of IPC operations ... Influenced by EROS [Shapiro et al. 1999], seL4 and Fiasco.OC [Lackorzynski and Warg 2009]) adopted IPC endpoints as IPC destinations. seL4 endpoints are essentially ports: The root of the queue of pending senders or receivers is a now a separate kernel object, instead of being part of the recipient’s thread control block (TCB). Unlike Mach ports [Accetta et al. 1986], IPC endpoints do not provide any buffering. ... In order to help servers identify clients without requiring per-client endpoints, seL4 provides badged capabilities , similar to the distinguished capabilities of KeyKOS? [Bromberger et al. 1992]. Capabilities with different badges but derived from the same original capability refer to the same (endpoint) object but on invokation deliver to the receiver the badge as an identification of the sender. "

IPC timeouts (dropped):

" A blocking IPC mechanism creates opportunities for denial-of- service (DOS) attacks. For example, a malicious (or buggy) client could send a request to a server without ever attempting to collect the reply; owing to the rendezvous-style IPC, the sender would block indefinitely unless it implements a watchdog to abort and restart...

To protect against such attacks, IPC operations in the original L4 had timeouts. Specifically, an IPC syscall specified four timeouts: one to limit blocking until start of the send phase, one to limit blocking in the receive phase, and two more to limit blocking on page faults during the send and receive phases (of long IPC).

Timeout values were encoded in a floating-point format that supported the values of zero, infinity, and finite values ranging from one millisecond to weeks. They added complexity for managing wakeup lists.

Practically, however, timeouts were of little use as a DOS defence. There is no theory, or even good heuristics, for choosing timeout values in a nontrivial system, and in practice, only the values zero and infinity were used: A client sends and receives with infinite timeouts, while a server waits for a request with an infinite but replies with a zero timeout. 7 (7: The client uses an RPC-style call operation, consisting of a send followed by an atomic switch to a receive phase, guaranteeing that the client is ready to receive the server’s reply)

Traditional watchdog timers represent a better approach to detecting unresponsive IPC interactions (e.g., resulting from deadlocks).

Having abandoned long IPC, in L4-embedded we replaced timeouts by a single flag supporting a choice of polling (zero timeout) or blocking (infinite timeout). Only two flags are needed, one for the send and one for the receive phase. seL4 follows this model. A fully asynchronous model, such as that of OKL4, is incompatible with time- outs and has no DOS issues that would require them.

Timeouts could also be used for timed sleeps by waiting on a message from a non- existing thread, a feature useful in real-time system. Dresden experimented with ex- tensions, including absolute timeouts, which expire at a particular wall clock time rather than relative to the commencement of the system call. Our approach is to give userland access to a (physical or virtual) timer "

L4 implementations

"He unveiled the 12-Kbyte fast L4 microkernel after 1996 while working for IBM in New York City. ... The L4 developer community is very active and a modern implementation of L4 in C++ is the L4Ka::Pistachio microkernel " [8] (2005)

"Karlsruhe’s experience with Version X and Hazelnut resulted in a major ABI revision, V4, aimed at improving kernel and application portability and multiprocessor support and addressing various other shortcomings. After Liedtke’s tragic death in 2001, his students implemented the design in a new open-source kernel, L4Ka::Pistachio (“Pistachio” for short)." [9]

"At NICTA we then retargeted Pistachio for use in resource-constrained embedded systems, resulting in a fork called NICTA::Pistachio-embedded (“L4-embedded”). It saw massive-scale commercial deployment when Qualcomm adopted it as a protected-mode real-time OS for the firmware of their wireless modem processors. It is now running on the security processor of all recent Apple iOS devices" [10]

" The influence of KeyKOS? [Hardy 1985] and EROS [Shapiro et al. 1999] and an increased focus on security resulted in the adoption of capabilities [Dennis and Van Horn 1966] for access control, first with the 2.1 release of OKL4 (2008) and soon followed by Fiasco; Fiasco was renamed Fiasco.OC in reference to its use of object capabilities. Aiming for formal verification, which seemed infeasible for a code base not designed for the purpose, we instead opted for a from-scratch implementation for our capability-based seL4 kernel. " [11]

"seL4’s SLOC count is somewhat bloated as a consequence of the C code being mostly a “blind” manual translation from Haskell [Klein et al. 2009], together with generated bit-field accessor functions, resulting in hundreds of small functions. The kernel compiles into about 9 k ARM instructions" [12]

" Later the UNSW group, at their new home at NICTA, forked L4Ka::Pistachio into a new L4 version called NICTA::L4-embedded. As the name implies, this was aimed at use in commercial embedded systems, and consequently the implementation trade-offs favored small memory footprints and aimed to reduce complexity. The API was modified to keep almost all system calls short enough that they do not need preemption points to ensure high real-time responsiveness.[10] " [13]

" On 29 July 2014, NICTA and General Dynamics C4 Systems announced that seL4, with end to end proofs, was now released under open source licenses.[20] The kernel source and proofs are under GPLv2, and most libraries and tools are under the 2-clause BSD license. " [14]

" F9 microkernel, a BSD-licensed L4 implementation, is built from scratch for deeply embedded devices with performance on ARM Cortex-M3/M4 processors, power consumption, and memory protection in mind. " [15]

F9 has about 3k LOC [16]

"the most general principles behind L4, minimality, including running device drivers at user level, generality, and a strong focus on performance, still remain relevant and foremost in the minds of developers. Specifically we find that the key microkernel per- formance metric, IPC latency, has remained essentially unchanged, in terms of clock cycles, as far as comparisons across vastly different ISAs and micro architectures have any validity. This is in stark contrast to the trend identified by Ousterhout [1990] just a few years before L4 was created. Furthermore, and maybe most surprisingly, the code size has essentially remained constant, a rather unusual development in software sys- tems" [17]

[18]