Fault-ToleranceOnTheCheapMaking systems that probably won’t fall over.
Hi, folks!
I do things
to/with
computers.
I’m a real-time,
networked
systems engineer.
Real-Time Systems
Real-Time Systems• Computation on a deadline
Real-Time Systems• Computation on a deadline
• Fail-safe / Fail-operational
Real-Time Systems• Computation on a deadline
• Fail-safe / Fail-operational
• Guaranteed response / Best
effort
Real-Time Systems• Computation on a deadline
• Fail-safe / Fail-operational
• Guaranteed response / Best
effort
• Resource ...
Networked Systems
Networked Systems• Out-of-order messages
Networked Systems• Out-of-order messages
• No legitimate concept of “now”
Networked Systems• Out-of-order messages
• No legitimate concept of “now”
• High-latency transmission
Networked Systems• Out-of-order messages
• No legitimate concept of “now”
• High-latency transmission
• Lossy transmission
Punk Rock
Version
Here’s a
socket.
Here’s an
interrupt.
Go program
a computer!
AdRoll
real-time
bidding
Erlang
Fault
Tolerance
Sub-components
fail. The system
does not.
Well, not right
away…
What’s it take?
Option 1: Perfection
• Total control over the whole
mechanism.
Option 1: Perfection
• Total control over the whole
mechanism.
• Total understanding of the
problem domain.
Option 1: Perfection
• Total control over the whole
mechanism.
• Total understanding of the
problem domain.
• Specific, explicit system goals.
O...
• Total control over the whole
mechanism.
• Total understanding of the
problem domain.
• Specific, explicit system goals.
•...
Option 1: Perfection
“They Write the Right Stuff”
Fast Company, 2005
• Extremely expensive.
Option 1: Perfection
• Extremely expensive.
• Intentionally stifles creativity.
Option 1: Perfection
• Extremely expensive.
• Intentionally stifles creativity.
• Design up front.
Option 1: Perfection
• Extremely expensive.
• Intentionally stifles creativity.
• Design up front.
• Complete control of the system
is not compl...
Option 2:
Hope for the Best
• Little up-front knowledge of the
problem domain.
Option 2: Hope for the Best
• Little up-front knowledge of the
problem domain.
• Implicit or short-term system
goals.
Option 2: Hope for the Best
• Little up-front knowledge of the
problem domain.
• Implicit or short-term system
goals.
• No money down.
Option 2: Hope ...
• Little up-front knowledge of the
problem domain.
• Implicit or short-term system
goals.
• No money down.
• Ingenuity und...
Option 2: Hope for the Best
“Move fast and
break things!”
• Ignorance of problem domain
leads to long-term system issues.
Option 2: Hope for the Best
• Ignorance of problem domain
leads to long-term system issues.
• Failures do propagate out
toward users.
Option 2: Hope f...
• Ignorance of problem domain
leads to long-term system issues.
• Failures do propagate out
toward users.
• No, money down...
• Ignorance of problem domain
leads to long-term system issues.
• Failures do propagate out
toward users.
• No, money down...
Option 3:
Embrace Faults
Option 3: Embrace Faults
• Partial control over the whole
mechanism.
Option 3: Embrace Faults
• Partial control over the whole
mechanism.
• Partial understanding of the
problem domain.
Option 3: Embrace Faults
• Partial control over the whole
mechanism.
• Partial understanding of the
problem domain.
• Sort...
Option 3: Embrace Faults
• Partial control over the whole
mechanism.
• Partial understanding of the
problem domain.
• Sort...
Option 3: Embrace Faults
“Fail fast. Either
do the right thing
or stop.”
“Why Do Computers Stop and What Can Be Done
About...
Option 3: Embrace Faults
• Faults are isolated but must be
resolved in production.
Option 3: Embrace Faults
• Faults are isolated but must be
resolved in production.
• Must carefully design for
introspecti...
Option 3: Embrace Faults
• Faults are isolated but must be
resolved in production.
• Must carefully design for
introspecti...
Option 3: Embrace Faults
• Faults are isolated but must be
resolved in production.
• Must carefully design for
introspecti...
Let’s talk
embracing
faults.
There are four
conceptual stages to
consider.
Component
The most
atomic level
of the system.
Progress here
has an outsized
impact.
Immutable
Data
Structures
Isolate
Side-Effects
Compile-Time
Guarantees
Why test, when
you can prove?
This is
Functional
Programming
Machine
Faults in
components are
exercised here.
Faults in
interactions are
exercised here.
Supervise,
and restart.
Use only
addressable
names.
Distinguish
your critical
components.
Cluster
Redundant
Components
No Single
Points of
Failure
Mean Time
to Failure
Estimates
Instrument
and Monitor
Organization
A finely built machine
without a supporting
organization is a disaster
waiting to happen.
A finely built machine
without
organization is a disaster
waiting to happen.
Chernobyl
STS-51-L
Deepwater Horizon
Magnitogo...
Correct the conditions
that allowed mistakes,
as well as the mistake.
Process is
Priceless
Build flexible
tools for experts.
Separate
Your
Concerns
Build with
Failure in mind.
Have
resources
you’re
willing to
sacrifice.
Study
accidents.
Every system
carries the
potential for its
own destruction.
Some things aren’t
worth building.
Understand
Networks.
0. The network is unreliable.
1. Latency is non-zero.
2. Bandwidth is finite.
3. The network is insecure.
4. Topology chang...
Thanks
so
much!
@bltroutwine
Recommended Reading“Normal Accidents: Living with High-Risk
Technologies”, Charles Perrow
“Digital Apollo: Human and Machi...
Upcoming SlideShare
Loading in...5
×

Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

1,343
-1

Published on

Building computer systems that are reliable is hard. The functional programming community has invested a lot of time and energy into up-front-correctness guarantees: types and the like. Unfortunately, absolutely correct software is time-consuming to write and expensive as a result. Fault-tolerant systems achieve system-total reliability by accepting that sub-components will fail and planning for that failure as a first-class concern of the system. As companies embrace the wave of "as-a-service" architectures, failure of sub-systems become a more pressing concern. Using examples from heavy industry, aeronautics and telecom systems, this talk will explore how you can design for fault-tolerance and how functional programming techniques get us most of the way there.

Published in: Software

Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

  1. 1. Fault-ToleranceOnTheCheapMaking systems that probably won’t fall over.
  2. 2. Hi, folks!
  3. 3. I do things to/with computers.
  4. 4. I’m a real-time, networked systems engineer.
  5. 5. Real-Time Systems
  6. 6. Real-Time Systems• Computation on a deadline
  7. 7. Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational
  8. 8. Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational • Guaranteed response / Best effort
  9. 9. Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational • Guaranteed response / Best effort • Resource adequate / inadequate
  10. 10. Networked Systems
  11. 11. Networked Systems• Out-of-order messages
  12. 12. Networked Systems• Out-of-order messages • No legitimate concept of “now”
  13. 13. Networked Systems• Out-of-order messages • No legitimate concept of “now” • High-latency transmission
  14. 14. Networked Systems• Out-of-order messages • No legitimate concept of “now” • High-latency transmission • Lossy transmission
  15. 15. Punk Rock Version
  16. 16. Here’s a socket.
  17. 17. Here’s an interrupt.
  18. 18. Go program a computer!
  19. 19. AdRoll
  20. 20. real-time bidding
  21. 21. Erlang
  22. 22. Fault Tolerance
  23. 23. Sub-components fail. The system does not.
  24. 24. Well, not right away…
  25. 25. What’s it take?
  26. 26. Option 1: Perfection
  27. 27. • Total control over the whole mechanism. Option 1: Perfection
  28. 28. • Total control over the whole mechanism. • Total understanding of the problem domain. Option 1: Perfection
  29. 29. • Total control over the whole mechanism. • Total understanding of the problem domain. • Specific, explicit system goals. Option 1: Perfection
  30. 30. • Total control over the whole mechanism. • Total understanding of the problem domain. • Specific, explicit system goals. • Well-known service lifetime. Option 1: Perfection
  31. 31. Option 1: Perfection “They Write the Right Stuff” Fast Company, 2005
  32. 32. • Extremely expensive. Option 1: Perfection
  33. 33. • Extremely expensive. • Intentionally stifles creativity. Option 1: Perfection
  34. 34. • Extremely expensive. • Intentionally stifles creativity. • Design up front. Option 1: Perfection
  35. 35. • Extremely expensive. • Intentionally stifles creativity. • Design up front. • Complete control of the system is not complete. Option 1: Perfection
  36. 36. Option 2: Hope for the Best
  37. 37. • Little up-front knowledge of the problem domain. Option 2: Hope for the Best
  38. 38. • Little up-front knowledge of the problem domain. • Implicit or short-term system goals. Option 2: Hope for the Best
  39. 39. • Little up-front knowledge of the problem domain. • Implicit or short-term system goals. • No money down. Option 2: Hope for the Best
  40. 40. • Little up-front knowledge of the problem domain. • Implicit or short-term system goals. • No money down. • Ingenuity under pressure. Option 2: Hope for the Best
  41. 41. Option 2: Hope for the Best “Move fast and break things!”
  42. 42. • Ignorance of problem domain leads to long-term system issues. Option 2: Hope for the Best
  43. 43. • Ignorance of problem domain leads to long-term system issues. • Failures do propagate out toward users. Option 2: Hope for the Best
  44. 44. • Ignorance of problem domain leads to long-term system issues. • Failures do propagate out toward users. • No, money down! Option 2: Hope for the Best
  45. 45. • Ignorance of problem domain leads to long-term system issues. • Failures do propagate out toward users. • No, money down! • Hard to change cultural values. Option 2: Hope for the Best
  46. 46. Option 3: Embrace Faults
  47. 47. Option 3: Embrace Faults • Partial control over the whole mechanism.
  48. 48. Option 3: Embrace Faults • Partial control over the whole mechanism. • Partial understanding of the problem domain.
  49. 49. Option 3: Embrace Faults • Partial control over the whole mechanism. • Partial understanding of the problem domain. • Sorta explicit system goals.
  50. 50. Option 3: Embrace Faults • Partial control over the whole mechanism. • Partial understanding of the problem domain. • Sorta explicit system goals. • Able to spot a failure when you see one.
  51. 51. Option 3: Embrace Faults “Fail fast. Either do the right thing or stop.” “Why Do Computers Stop and What Can Be Done About it?”, Jim Gray, 1985 (paraphrase)
  52. 52. Option 3: Embrace Faults • Faults are isolated but must be resolved in production.
  53. 53. Option 3: Embrace Faults • Faults are isolated but must be resolved in production. • Must carefully design for introspection.
  54. 54. Option 3: Embrace Faults • Faults are isolated but must be resolved in production. • Must carefully design for introspection. • Moderate design up-front.
  55. 55. Option 3: Embrace Faults • Faults are isolated but must be resolved in production. • Must carefully design for introspection. • Moderate design up-front. • Pay a little now, pay a little later.
  56. 56. Let’s talk embracing faults.
  57. 57. There are four conceptual stages to consider.
  58. 58. Component
  59. 59. The most atomic level of the system.
  60. 60. Progress here has an outsized impact.
  61. 61. Immutable Data Structures
  62. 62. Isolate Side-Effects
  63. 63. Compile-Time Guarantees
  64. 64. Why test, when you can prove?
  65. 65. This is Functional Programming
  66. 66. Machine
  67. 67. Faults in components are exercised here.
  68. 68. Faults in interactions are exercised here.
  69. 69. Supervise, and restart.
  70. 70. Use only addressable names.
  71. 71. Distinguish your critical components.
  72. 72. Cluster
  73. 73. Redundant Components
  74. 74. No Single Points of Failure
  75. 75. Mean Time to Failure Estimates
  76. 76. Instrument and Monitor
  77. 77. Organization
  78. 78. A finely built machine without a supporting organization is a disaster waiting to happen.
  79. 79. A finely built machine without organization is a disaster waiting to happen. Chernobyl STS-51-L Deepwater Horizon Magnitogorsk Damascus Incident Chevron Refinery BART ATC Asiana #214 Therac-25 New Orleans Levee
  80. 80. Correct the conditions that allowed mistakes, as well as the mistake.
  81. 81. Process is Priceless
  82. 82. Build flexible tools for experts.
  83. 83. Separate Your Concerns
  84. 84. Build with Failure in mind.
  85. 85. Have resources you’re willing to sacrifice.
  86. 86. Study accidents.
  87. 87. Every system carries the potential for its own destruction.
  88. 88. Some things aren’t worth building.
  89. 89. Understand Networks.
  90. 90. 0. The network is unreliable. 1. Latency is non-zero. 2. Bandwidth is finite. 3. The network is insecure. 4. Topology changes. 5. There are many administrators. 6. Transport cost is non-zero. 7. The network is heterogenous.
  91. 91. Thanks so much! @bltroutwine
  92. 92. Recommended Reading“Normal Accidents: Living with High-Risk Technologies”, Charles Perrow “Digital Apollo: Human and Machine in Spaceflight”, David A. Mindel “Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety”, Eric Schlosser “Erlang Programming”, Simon Thompson and Francesco Cesarini “Steeltown, USSR”, Stephen Kotkin “Crash-Only Software”, George Candea and Armando Fox “The Truth About Chernobyl”, Grigorii Medvedev “Real-Time Systems: Design Principles for Distributed Embedded Applications”, Hermann Kopetz “ Th e Ap o l l o G u i d a n c e C o m p u t e r : Architecture and Operation”, Frank O’Brien “Why Do Computers Stop and What Can Be Done About It?”, Jim Gray “Thirteen: The Apollo Flight That Failed”, Henry S.F. Cooper Jr.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×