OTP Supervisors

Beamtalk actors are OTP gen_servers. When an actor crashes, you want the system to restart it automatically — that's what supervisors are for.

A supervisor watches a set of actor processes. When a child crashes, the supervisor applies a restart strategy: restart just that child, or restart all of them, depending on the policy.

This chapter covers:

  1. Supervisor subclass: — static child list, known at start-up
  2. Restart strategies and policies
  3. DynamicSupervisor subclass: — add children at runtime
  4. Supervision in practice: a resilient counter service

Static supervisors

A static supervisor knows its children at start-up time. Subclass Supervisor and override class children to return the list of actor classes to supervise:

TestCase subclass: Ch17StaticSupervisor

  testSupervisorClass =>
    // Supervisor subclass: defines a supervision tree.
    // class children => returns the list of actors to supervise.
    self assert: CounterApp strategy equals: #oneForOne
    self assert: CounterApp children size equals: 1

  testSupervisorIsSupervisorFlag =>
    self assert: CounterApp isSupervisor
    self deny: Counter isSupervisor

The CounterApp supervisor is defined like this (not run as a doctest — it requires an OTP application environment):

Actor subclass: Counter
  state: value = 0

  increment => self.value := self.value + 1
  value => self.value

Supervisor subclass: CounterApp
  class children => #[Counter supervisionSpec]

class children returns an array of SupervisionSpec values. The simplest spec is SomeActorClass supervisionSpec, which uses the actor's default restart policy (#temporary).

Supervision specs

A SupervisionSpec describes how to start one supervised child. It is built from an actor class using fluent setter methods:

TestCase subclass: Ch17SupervisionSpecs

  testDefaultSpecHasTemporaryRestart =>
    spec := Counter supervisionSpec
    self assert: spec restart equals: #temporary

  testCustomRestartPolicy =>
    spec := Counter supervisionSpec withRestart: #permanent
    self assert: spec restart equals: #permanent

  testCustomId =>
    spec := Counter supervisionSpec withId: #mainCounter
    self assert: spec id equals: #mainCounter

  testChainedBuilders =>
    spec := Counter supervisionSpec
      withId: #primary
      withRestart: #permanent
    self assert: spec id equals: #primary
    self assert: spec restart equals: #permanent

Actor restart policies

Each actor class can declare its own default restart policy. Override class supervisionPolicy to change it:

TestCase subclass: Ch17RestartPolicy

  testDefaultPolicyIsTemporary =>
    // Actors default to #temporary — not restarted on crash
    self assert: Counter supervisionPolicy equals: #temporary

  testSpecInheritsActorPolicy =>
    // supervisionSpec picks up the actor's policy automatically
    spec := Counter supervisionSpec
    self assert: spec restart equals: #temporary

Restart policies:

PolicyMeaning
#temporaryNever restarted (default)
#transientRestarted on abnormal termination only
#permanentAlways restarted

To make a worker always restart, override class supervisionPolicy in the actor class (not shown as a doctest — overriding class methods requires a class definition):

Actor subclass: PersistentWorker
  class supervisionPolicy => #permanent
  state: value = 0
  increment => self.value := self.value + 1

Restart strategies

Override class strategy on your supervisor to change the strategy:

StrategyMeaning
#oneForOneOnly restart the crashed child (default)
#oneForAllRestart all children when one crashes
#restForOneRestart the crashed child and all children started after it
TestCase subclass: Ch17Strategies

  testDefaultStrategyIsOneForOne =>
    self assert: CounterApp strategy equals: #oneForOne

Dynamic supervisors

A DynamicSupervisor starts with no children. You add children at runtime as demand requires. This is ideal for per-request or per-connection workers.

Subclass DynamicSupervisor and override class childClass:

TestCase subclass: Ch17DynamicSupervisor

  testDynamicSupervisorIsSupervisorFlag =>
    // DynamicSupervisor subclasses are also supervisors
    self assert: WorkerPool isSupervisor

  testDynamicSupervisorHasNoStaticChildren =>
    // Dynamic supervisors don't define class children
    self assert: WorkerPool childClass equals: Counter

A WorkerPool is defined like this:

DynamicSupervisor(Counter) subclass: WorkerPool
  class childClass => Counter

At runtime, call startChild / startChild: on a running supervisor instance to add workers:

// pool := WorkerPool supervise   // start the supervisor
// pool startChild                // add a Counter child
// pool count                     // => 1
// pool startChild                // add another
// pool count                     // => 2

Supervisor lifecycle

TestCase subclass: Ch17Lifecycle

  testSupervisorClassMethods =>
    // supervise — starts the supervisor as an OTP process
    // current  — retrieves a running supervisor
    // isSupervisor — true for all Supervisor subclasses
    self assert: CounterApp isSupervisor
    self assert: CounterApp strategy equals: #oneForOne
    self assert: (CounterApp children size) equals: 1

Key class-side API:

MessageReturnsDescription
MyApp supervisesupervisor pidStart the supervision tree
MyApp currentsupervisor pidGet the running instance
MyApp isSupervisorBooleanAlways true for supervisors
MyApp strategySymbolRestart strategy
MyApp childrenArrayStatic child specs

Key instance-side API:

MessageDescription
sup childrenList running children
sup which: CounterFind a specific child
sup terminate: CounterStop a specific child
sup countCount running children
sup stopShut down the supervisor and all children

Graceful shutdown timeout

By default, workers get 5000ms to shut down and nested supervisors get unlimited time. Use withShutdown: to override the timeout (in milliseconds) for children that need time to drain connections or flush state:

HttpServer supervisionSpec withShutdown: 30000   // 30s graceful shutdown

Summary

Static supervision tree:

Supervisor subclass: MyApp
  class strategy => #oneForOne          // optional, default
  class children => #[
    SomeActor supervisionSpec,
    OtherActor supervisionSpec withRestart: #permanent
  ]

Dynamic supervision tree:

DynamicSupervisor(Counter) subclass: WorkerPool
  class childClass => Counter

// pool := WorkerPool supervise
// pool startChild        // spawn a new Counter
// pool startChild: args  // spawn with arguments
// pool count             // how many children

Restart policies:

#temporary    never restarted (default)
#transient    restarted on abnormal exit only
#permanent    always restarted

Strategies:

#oneForOne    restart only the crashed child (default)
#oneForAll    restart all children when any crashes
#restForOne   restart crashed child + all after it

Exercises

1. Default restart policy. What restart policy does a new actor class have by default? How do you check it?

Hint
Counter supervisionPolicy    // => #temporary

All actors default to #temporary — they are never automatically restarted. Override class supervisionPolicy => #permanent to change this.

2. Custom supervision spec. Create a supervision spec for Counter with a #permanent restart policy and a custom ID of #mainCounter. Chain the builder methods.

Hint
spec := Counter supervisionSpec
  withId: #mainCounter
  withRestart: #permanent
spec id         // => #mainCounter
spec restart    // => #permanent

3. Strategy choice. When would you choose #oneForAll over #oneForOne? Give a concrete example.

Hint

Use #oneForAll when children depend on each other and can't function independently. Example: a database connection pool and a cache actor — if the pool crashes, the cache holds stale connections and must also restart. #oneForOne (default) is best when children are independent, like multiple worker processes handling separate requests.

Next: Chapter 18 — File I/O