Understanding the "Let It Crash" philosophy

You may have heard of a philosophy associated with Erlang called "Let It Crash". Erlang applications can quickly recover from errors by restarting parts of their system. Elixir also embraces this philosophy.

Unfortunately, this approach is also frequently misunderstood, as a result, some people assume that let it crash means that Erlang and Elixir developers don’t do any error handling, which is not the case. Pattern matching and the with macro in Elixir make working with {:ok, result} and {:error, msg} tuples easy, and this approach is widely used in the community. Elixir also has try and rescue for catching exceptions, similarly to try and catch in other languages.

However, as much as we try as engineers, we know that errors can happen. This often leads to something called defensive programming. It describes the practice of relentlessly trying to cover every single possible scenario for failure, even when some scenarios are very unlikely to happen, and not worth dealing with.

Erlang and Elixir take a different approach to defensive programming. Since all code runs in processes and processes that are lightweight, they focus on how the system can recover from crashes versus how to prevent all crashes. You can choose to allow a part (or even the whole) of the application to crash and restart, but handle other errors yourself. This shift in thinking and software design is the reason why Erlang became famous for its reliability and scalability.

That’s where Elixir Supervisors come into play, they can isolate crashes and also restart child processes. There are three different restart values available to us:

:temporary option which never restarts processes
:transient will restart child processes, but only when they exit with an error
:permanent always restarts children, keeping them running, even when they try to shut down without an error

The word restart is slightly misleading in the context of Elixir processes. Once a process exits, it cannot be brought back to life. Therefore, restarting a process results in starting a new process to take the place of the old one, using the same child specification.

Depending on the frequency of the crash, the supervisor itself can also terminate when a child process cannot be recovered. Remember that supervisor’s responsibility is to watch over its processes. If a :transient or :permanent restart value is used and a process keeps crashing, the supervisor will exit, because it has failed to restart that process.

To conclude, let it crash is more than simple redundancy. It’s about implementing self recoverability of the application. It’s about putting your site reliability efforts into your architecture rather than low level defensive coding. It’s about decoupling your application and introducing asynchronicity in recognition that things go wrong in surprising ways. Ironically, sitting back and cooly letting your software crash can lead to better software.

If you want to know more about how to implement fault-tolerant systems in Erlang, I recommend reading the Designing for Scalability with Erlang book, it certainly will give you a better understanding of how things work and how powerful the BEAM really is.