An Outline of A Note on Distributed Computing
September 4, 2013
0. Abstract
- Objs in distributed system treated differently than objs in single address space.
- Four reasons: latency, memory access, concurrency, partial failure
- Systems that paper over distinctions not robust or reliable
I. Introduction
- Much work based on assump that objs form single ontological class
- Thesis: above view mistaken.
- Work that ignores differences doomed to failure.
- Terminology
- local computing: progs confined to single addr space
- distributed computing: progs that call to other addr spaces
- middle ground e.g. calls between addr spaces on same machine
II. The Vision of Unified Objects
- Obj defined in interface def language and implementation is independent
- “Objects all the way down”
- Three phases of dev
- Write app without worrying about object location
- Tune performance by concretizing object locs and communication
- Final phase to test with “real bullets” then add components to deal (“fault points” will appear)
- Mistaken assumptions we will disprove
- single natural obj-oriented design exists for app regardless of context
- failure and performance tied to implementation
- interface of obj independent of context
III. Deja Vu All Over Again
- Two paths for comm protocol dev: integrate w/ current lang model vs solve inherent distributed systems problems
- Difficult part not the communication, hard part is partial failure, lack of central resource manager
IV. Local and Distributed Computing
- Latency
- 4 to 5 orders of magnitude difference
- False answers: rely on steadily increasing speed of hardware, admin need for tools to see how objects interact and then bring those objects together
- Could strive for future in which difference is conceptually indistinguishable
- Memory Access
- Pointers not valid across address spaces
- Could make all pointer refs return obis but would violate local paradigm
- Danger: promote myth that remote and local accesses are same without enforcing myth
- Partial failure and concurrency
- Both worlds contain components subject to periodic failure, but only detectable in local.
- E.g. failure of network link indistinguishable from failure of processor on other side.
- In distributed computing, very difficult to restore system to consistent state after failure because no single manager of state
- Impossible to know state of system before and after failure so single components must be able to state possible causes of failure.
- There is not indeterminism in how much computation completed in a local system.
- What is the price to make remote & local identical?
- Path 1: Treat all as if local. Result is indeterministic in face of partial failure and therefore fragile.
- Path 2: Design all interfaces as if remote, but this adds unnecessary complexity to local computing and defeats purpose.
- Distributed objects must handle concurrent method invocations.
- Differences from multi-threaded app:
- In multi-thread, there is no real source of indeterminacy
- Distributed has no single pt of resource allocation (threaded has OS)
V. The Myth of "Quality of Service"
- Could leave it to implementation of object to deal with above issues.
- I.e. to build a more reliable system, choose more reliable implementations of interfaces making up system.
- Robustness also depends on interaction between components (e.g. enqueuing work multiple times).
- What happens if the client chooses to repeat this operation with the exact same params as previously?
VI. Lessons from NFS
- Non-distributed API re-implemented in distributed way
- To deal with inaccessible file server soft mount or hard mount
- Soft mounts expose network failure to client program.
- Ops fail much more often and programs w/ no allowance for these failures will corrupt FS.
- Hard mount: hang until server comes back up.
- One server crashes and many workstations freeze.
- NFS statelessness reduces complexity of failure modes.
- NFS is successful, but depends on a centralized resource manager (a sysadmin) to deal with resource reclamation, etc.
VII. Taking the Difference Seriously
- Do not try to merge local and remote obis. Instead, be constantly reminded of their differences.
- Need to consider anticipated msg frequency for obj and whether clients can accept indeterminacy implied by remote access.
- Interface must allow for reliability in face of partial failure.
VII. A Middle Ground
- “Local-remote” objects do exist. Objects are in different address space, but managed by a single resource manager.
- Failure modes are more nearly deterministic.
comments powered by