Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

1. Fault-ToleranceOnTheCheapMaking systems that probably won’t fall over.

2. Hi, folks!

3. I do things to/with computers.

4. I’m a real-time, networked systems engineer.

5. Real-Time Systems

6. Real-Time Systems• Computation on a deadline

7. Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational

8. Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational • Guaranteed response / Best eﬀort

9. Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational • Guaranteed response / Best eﬀort • Resource adequate / inadequate

10. Networked Systems

11. Networked Systems• Out-of-order messages

12. Networked Systems• Out-of-order messages • No legitimate concept of “now”

13. Networked Systems• Out-of-order messages • No legitimate concept of “now” • High-latency transmission

14. Networked Systems• Out-of-order messages • No legitimate concept of “now” • High-latency transmission • Lossy transmission

15. Punk Rock Version

16. Here’s a socket.

17. Here’s an interrupt.

18. Go program a computer!

19. AdRoll

20. real-time bidding

21. Erlang

22. Fault Tolerance

23. Sub-components fail. The system does not.

24. Well, not right away…

25. What’s it take?

26. Option 1: Perfection

27. • Total control over the whole mechanism. Option 1: Perfection

28. • Total control over the whole mechanism. • Total understanding of the problem domain. Option 1: Perfection

29. • Total control over the whole mechanism. • Total understanding of the problem domain. • Speciﬁc, explicit system goals. Option 1: Perfection

30. • Total control over the whole mechanism. • Total understanding of the problem domain. • Speciﬁc, explicit system goals. • Well-known service lifetime. Option 1: Perfection

31. Option 1: Perfection “They Write the Right Stuﬀ” Fast Company, 2005

32. • Extremely expensive. Option 1: Perfection

33. • Extremely expensive. • Intentionally stiﬂes creativity. Option 1: Perfection

34. • Extremely expensive. • Intentionally stiﬂes creativity. • Design up front. Option 1: Perfection

35. • Extremely expensive. • Intentionally stiﬂes creativity. • Design up front. • Complete control of the system is not complete. Option 1: Perfection

36. Option 2: Hope for the Best

37. • Little up-front knowledge of the problem domain. Option 2: Hope for the Best

38. • Little up-front knowledge of the problem domain. • Implicit or short-term system goals. Option 2: Hope for the Best

39. • Little up-front knowledge of the problem domain. • Implicit or short-term system goals. • No money down. Option 2: Hope for the Best

40. • Little up-front knowledge of the problem domain. • Implicit or short-term system goals. • No money down. • Ingenuity under pressure. Option 2: Hope for the Best

41. Option 2: Hope for the Best “Move fast and break things!”

42. • Ignorance of problem domain leads to long-term system issues. Option 2: Hope for the Best

43. • Ignorance of problem domain leads to long-term system issues. • Failures do propagate out toward users. Option 2: Hope for the Best

44. • Ignorance of problem domain leads to long-term system issues. • Failures do propagate out toward users. • No, money down! Option 2: Hope for the Best

45. • Ignorance of problem domain leads to long-term system issues. • Failures do propagate out toward users. • No, money down! • Hard to change cultural values. Option 2: Hope for the Best

46. Option 3: Embrace Faults

47. Option 3: Embrace Faults • Partial control over the whole mechanism.

48. Option 3: Embrace Faults • Partial control over the whole mechanism. • Partial understanding of the problem domain.

49. Option 3: Embrace Faults • Partial control over the whole mechanism. • Partial understanding of the problem domain. • Sorta explicit system goals.

50. Option 3: Embrace Faults • Partial control over the whole mechanism. • Partial understanding of the problem domain. • Sorta explicit system goals. • Able to spot a failure when you see one.

51. Option 3: Embrace Faults “Fail fast. Either do the right thing or stop.” “Why Do Computers Stop and What Can Be Done About it?”, Jim Gray, 1985 (paraphrase)

52. Option 3: Embrace Faults • Faults are isolated but must be resolved in production.

53. Option 3: Embrace Faults • Faults are isolated but must be resolved in production. • Must carefully design for introspection.

54. Option 3: Embrace Faults • Faults are isolated but must be resolved in production. • Must carefully design for introspection. • Moderate design up-front.

55. Option 3: Embrace Faults • Faults are isolated but must be resolved in production. • Must carefully design for introspection. • Moderate design up-front. • Pay a little now, pay a little later.

56. Let’s talk embracing faults.

57. There are four conceptual stages to consider.

58. Component

59. The most atomic level of the system.

60. Progress here has an outsized impact.

61. Immutable Data Structures

62. Isolate Side-Eﬀects

63. Compile-Time Guarantees

64. Why test, when you can prove?

65. This is Functional Programming

66. Machine

67. Faults in components are exercised here.

68. Faults in interactions are exercised here.

69. Supervise, and restart.

70. Use only addressable names.

71. Distinguish your critical components.

72. Cluster

73. Redundant Components

74. No Single Points of Failure

75. Mean Time to Failure Estimates

76. Instrument and Monitor

77. Organization

78. A ﬁnely built machine without a supporting organization is a disaster waiting to happen.

79. A ﬁnely built machine without organization is a disaster waiting to happen. Chernobyl STS-51-L Deepwater Horizon Magnitogorsk Damascus Incident Chevron Reﬁnery BART ATC Asiana #214 Therac-25 New Orleans Levee

80. Correct the conditions that allowed mistakes, as well as the mistake.

81. Process is Priceless

82. Build ﬂexible tools for experts.

83. Separate Your Concerns

84. Build with Failure in mind.

85. Have resources you’re willing to sacriﬁce.

86. Study accidents.

87. Every system carries the potential for its own destruction.

88. Some things aren’t worth building.

89. Understand Networks.

90. 0. The network is unreliable. 1. Latency is non-zero. 2. Bandwidth is ﬁnite. 3. The network is insecure. 4. Topology changes. 5. There are many administrators. 6. Transport cost is non-zero. 7. The network is heterogenous.

91. Thanks so much! @bltroutwine

92. Recommended Reading“Normal Accidents: Living with High-Risk Technologies”, Charles Perrow “Digital Apollo: Human and Machine in Spaceﬂight”, David A. Mindel “Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety”, Eric Schlosser “Erlang Programming”, Simon Thompson and Francesco Cesarini “Steeltown, USSR”, Stephen Kotkin “Crash-Only Software”, George Candea and Armando Fox “The Truth About Chernobyl”, Grigorii Medvedev “Real-Time Systems: Design Principles for Distributed Embedded Applications”, Hermann Kopetz “ Th e Ap o l l o G u i d a n c e C o m p u t e r : Architecture and Operation”, Frank O’Brien “Why Do Computers Stop and What Can Be Done About It?”, Jim Gray “Thirteen: The Apollo Flight That Failed”, Henry S.F. Cooper Jr.

Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Brian Troutwine

Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

A particular slide catching your eye?