Unix in the Cloud
Ignorance, Stagnation, Obsolescence
Synopsis
- cloud in the broad sense of ideology
- not quite about running BSD on EC2
- very limited to skills and experience of yours humbly
Multi-core
- installation?
- configuration management?
- load balancing?
Multi-node
- installation?
- configuration management?
- load balancing?
- why multi-node?
Large Computing Needs
- Facebook, Google, ...
- more than any OS can provide
Happy Hardware Vendor Law
The amount of nodes needed to solve a given task doubles every now and again.
OS Scalability Limit
- 1 node only
- multi-socket and stacks approaching NUMA
- E25K, z10, etc — fail for most purposes
Operating System — ?
- traditional definition no more relevant
- the notion itself on the brink of obsolescence
- field heavily eroded by current distributed apps
Distributed Applications
- forced to be an OS unto themselves
- huge overlap
- huge opportunity for sharing and consolidation
Anti-Patterns
- virtualization
- chefs and puppets
- thick abstraction
Attempts
- z/OS
- Plan 9, Inferno
- Clustrx, E1, DYSEAC, ...
- OpenStack (~~)
Species Survival Plan
Freeze the bodies and leave them for future generations to fix.
Don't Panic: Incremental
- perfection v. done
- still a decade or more till a good AI
- no practical need for POSIX over a cloud
Mindful Approach
- immediate practicality
- long-term perspective
- sustained, integrally rich effect
Operating System
- major abstraction repository
- overlapping code distillery
- pre-production architecture research
Increments
Machine Generated Data
- logs, error messages, status monitors
- meant for humans... no more
- rethinking for better aggregation and analysis
Identity and Authentication
- YP, LDAP outdated and poorly supported
- no distributed model
- passwd in git as a first stab
Remote Procedure Call
- ssh losing relevance, HPN or not
- all-mighty agent daemon worse than rsh
- capabilities, RBAC, WoT
Hardware Failures
- no culture for low-level fault-tolerance
- watchdogd as state-of-the-art self-healing
- focus on self-diagnostics: disk error counters, etc
Distributed Configuration
- current anti-patterns worsen the problem
- role-aware configuration
- / in git as a second stab
Storage
- intra-node redundancy irrelevant
- no appropriate local multi-disk FS
- no fast path for data exchange
- nginx + curl + dispatcher
Error Handling
- cf MGD and hardware failures
- software is 10x more prone to failures
- serious problem at scale
☺