Bayle Shanks's website: notes-computer-programming-programmingLanguageDesign-aNoteOnDistributedComputing

notes on A Note on Distributed Computing by Jim Waldo, Geoff Wyant, Ann Wollrath, Sam Kendall

" A direct result of this work was the invention (by Ann Wollrath) of the Remote Method Invocation system (RMI) that has become a standard part of the JavaTM? platform. The paper also formed the basis of a design philosophy that led to the JiniTM? Networking technology, " -- (a forward to the paper A Note on Distributed Computing)

" objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a dif- ferent model of memory access, and take into account issues of concurrency and partial failure. "

" The hard problems in distributed computing are not the problems of how to get things on and off the wire. The hard problems in distributed computing concern dealing with partial failure and the lack of a central resource man- ager. The hard problems in distributed computing concern insuring adequate performance and dealing with problems of concurrency. The hard problems have to do with differ- ences in memory access paradigms between local and dis- tributed entities. People attempting to write distributed applications quickly discover that they are spending all of their efforts in these areas and not on the communications protocol programming interface. "

" The major differences between local and distributed com- puting concern the areas of latency, memory access, partial failure, and concurrency."

" The most obvious difference between a local object invo- cation and the invocation of an operation on a remote (or possibly remote) object has to do with the latency of the two calls. The difference between the two is currently between four and five orders of magnitude, "

" Ignoring the difference between the performance of local and remote invocations can lead to designs whose imple- mentations are virtually assured of having performance problems because the design requires a large amount of communication between components that are in different address spaces and on different machines. "

" A more fundamental (but still obvious) difference between local and remote computing concerns the access to mem- ory in the two cases—specifically in the use of pointers. Simply put, pointers in a local address space are not valid in another (remote) address space. The system can paper over this difference, but for such an approach to be suc- cessful, the transparency must be complete. Two choices exist: either all memory access must be controlled by the underlying system, or the programmer must be aware of the different types of access—local and remote. There is no inbetween. "

" If the desire is to completely unify the programming model—to make remote accesses behave as if they were in fact local—the underlying mechanism must totally control all memory access. "

"...in distributed computing, where one component (machine, network link) can fail while the oth- ers continue. Not only is the failure of the distributed com- ponents independent, but there is no common agent that is able to determine what component has failed and inform the other components of that failure, no global state that can be examined that allows determination of exactly what error has occurred. "

" This is not the case in distributed computing, where one component (machine, network link) can fail while the oth- ers continue. Not only is the failure of the distributed com- ponents independent, but there is no common agent that is able to determine what component has failed and inform the other components of that failure, no global state that can be examined that allows determination of exactly what error has occurred. ... These sorts of failures are not the same as mere exception raising or the inability to complete a task, which can occur in the case of local computing. This type of failure is caused when a machine crashes during the execution of an object invocation or a network link goes down, occur- rences that cause the target object to simply disappear rather than return control to the caller. A central problem in distributed computing is insuring that the state of the whole system is consistent after such a failure; this is a problem that simply does not occur in local computing. "

"Partial failure requires that programs deal with indeterminacy. "

" If an object is coresident in an address space with its caller, partial failure is not possible. A function may not complete normally, but it always completes. "

" The question is not “can you make remote method invocation look like local method invoca- tion?” but rather “what is the price of making remote method invocation identical to local method invocation?” ... The first path is to treat all objects as if they were local and design all interfaces as if the objects calling them, and being called by them, were local. ... The other path is to design all interfaces as if they were remote... this introduces unnecessary guarantees and semantics for objects that are never intended to be used remotely.... This approach would also defeat the overall purpose of unifying the object models. The real reason for attempting such a unification is to make distributed computing more like local computing and thus make distributed computing easier. "

" One might argue that a multi-threaded application needs to deal with these same issues. However, there is a subtle dif- ference. In a multi-threaded application, there is no real source of indeterminacy of invocations of operations. The application programmer has complete control over invoca- tion order when desired. A distributed system by its nature introduces truly asynchronous operation invocations. Further, a non-distributed system, even when multi-threaded, is layered on top of a single operating system that can aid the communication between objects and can be used to determine and aid in synchronization and in the recovery of failure. A distributed system, on the other hand, has no single point of resource allocation, synchronization, or failure recovery, and thus is conceptually very different. "

" Let us imagine that I build an application that uses the (mythical) queue interface to enqueue work for some com- ponent. My application dutifully enqueues records that represent work to be done. Another application dutifully dequeues them and performs the work. After a while, I notice that my application crashes due to time-outs. I find this extremely annoying, but realize that it’s my fault. My application just isn’t robust enough. It gives up too easily on a time-out. So I change my application to retry the operation until it succeeds. Now I’m happy. I almost never see a time-out. Unfortunately, I now have another prob- lem. Some of the requests seem to get processed two, three, four, or more times. ... since the entities being enqueued are just values, there is no way to do duplicate elimination. The only way to fix this is to change the protocol to add request IDs. "

" Similar situations can be found throughout the standard set of interfaces. Suppose I want to reliably remove a name from a context. ... I keep trying the operation until it succeeds (or until I crash). The problem is that my connection to the name server may have gone down, but another client’s may have stayed up. I may have, in fact, successfully removed the name but not discovered it because of a net- work disconnection. The other client then adds the same name, which I then remove. Unless the naming interface includes an operation to lock a naming context, there is no way that I can make this operation completely robust. "

" In the design of any opera-tion, the question has to be asked: what happens if the cli- ent chooses to repeat this operation with the exact same parameters as previously? What mechanisms are needed to ensure that they get the desired semantics? These are things that can be expressed only at the interface level. "

" Similar arguments can be made about performance. Sup- pose an interface describes an object which maintains sets of other objects. A defining property of sets is that there are no duplicates. Thus, the implementation of this object needs to do duplicate elimination. If the interfaces in the system do not provide a way of testing equality of refer- ence, the objects in the set must be queried to determine equality. ...The overall performance of eliminating dupli- cates is going to be governed by the latency in communi- cating over the slowest communications link involved. "

" Lessons from NFS ... NFS®, Sun’s distributed computing file system [14], [15] is an example of a non- distributed application programer interface (API) (open, read, write, close, etc.) re-implemented in a distributed way. ... NFS opened the door to partial failure within a file system. It has essentially two modes for dealing with an inaccessi- ble file server: soft mounting and hard mounting. But since the designers of NFS were unwilling (for easily under- standable reasons) to change the interface to the file sys- tem to reflect the new, distributed nature of file access, neither option is particularly robust. Soft mounts expose network or server failure to the client program. Read and write operations return a failure status much more often than in the single-system case, and pro- grams written with no allowance for these failures can eas- ily corrupt the files used by the program. ... Today, soft mounts are sel- dom used, and when they are used, their use is generally restricted to read-only file systems or special applications. Hard mounts mean that the application hangs until the server comes back up. This generally prevents a client pro- gram from seeing partial failure, but it leads to a malady familiar to users of workstation networks: one server crashes, and many workstations—even those apparently having nothing to do with that server—freeze. "

"The reliability of NFS cannot be changed without a change to " the traditional file system interface

" The way NFS has dealt with partial failure has been to infor- mally require a centralized resource manager (a system administrator) who can detect system failure, initiate resource reclamation and insure system consistency. But by introducing this central resource manager, one could argue that NFS is no longer a genuinely distributed appli- cation. "