Hiding Theory in Practice

2022/11/23

Hiding Theory in Practice

I'm a self-labeled incident nerd. I very much enjoy reading books and papers about them, I hang out with other incident nerds, and I always look for ways to connect the theory I learn about with the events I see at work and in everyday life. As it happens, studying incidents tends to put you in close proximity with many systems that are in various states of failure, which also tends to elicit all sorts of negative reactions from the people around them.

This sensitive nature makes it perhaps unsurprising that incident investigation and review facilitation come with a large number of concepts and practices you are told to avoid because they are considered counterproductive. A tricky question I want to discuss in this post is how to deal with them when you see them come up.

A small sample of these undesirable concepts includes things such as:

  • Root Cause: I've covered this one in Errors are constructed, not discovered. To put it briefly, focusing on root causes tends to narrow the investigation in a way that ignores a rich tapestry of contributing factors.
  • Normative Judgments: this is often used when saying someone should have done something that they have not. It carries the risk of siding with the existing procedure as correct and applicable by default, and tends to blame and demand change from operators more than their tools and support structure.
  • Counterfactuals: those are about things that did not happen: "had we been warned earlier, none of this would have cascaded." This is a bit like preparing for yesterday's battle. It's very often coupled with normative judgments ("the operator failed to do X, which led to ...")
  • Human Error: generally not a useful concept, at least not in the way you'd think. This is best covered in "Those found responsible have been sacked" by Richard Cook or The Field Guide to Understanding 'Human Error', but tends to be the sign of an organization protecting itself, or of a failed investigation. Generally the advice is that if you find human error, that's where the investigation begins, not where it ends.
  • Blame: psychological safety is generally hard to maintain if people feel that they are going to be punished for doing their best and trying to help. You can only get good information if people trust that they can reveal it. Blameless processes—or rather, blame-aware reviews aim to foster this safety.

There are more concepts than these, and each could be a post on its own. I've chosen this list because each of them is an absolutely common reaction, something so intuitive it will feel self-evident to people using them. Avoiding these requires a kind of unlearning, so that you can remove the usual framing you'd use to interpret events, and then gradually learning to re-construct them differently.

This is challenging, and while this is something you and other self-labeled incident nerds can extensively discuss and debate as peers, it is not something you can reasonably expect others to go through in a natural post-incident setting. Most of the people with whom you will interact will never care about the theory as much as you do, almost by definition since you're likely to represent expertise for the whole organization on these topics.

In short, you need to find how to act in a way that is coherent with the theory you hold as an inspiration while being flexible enough to not cause friction with others, nor requiring them to know everything you know for your own work to be effective.

As an investigator or facilitator, let's imagine someone who's a technical expert on the team comes to you during the investigation (before the review) and says "I don't get why the on-call engineer couldn't find the root cause right away since it was so obvious. All they had to do was follow the runbook and everything would have been fine!"

There are going to be times where it's okay to let go of these comments, to avoid doing a deep dive on every opportunity. In the context of a review based on a thematic analysis, the themes you are focusing on should help direct where you put your energy, and guide you to figure out whether emotionally-charged comments are relevant or not.

But let's assume they are relevant to your themes, or that you're still trying to figure them out. Here are two reactions you can have, which may come up as easy solutions but are not very constructive:

  • You may want to police their intervention: since you care for blame-awareness and psychological safety, you may want to nip this behavior in the bud and let them know about the issues around blame, normativeness and counterfactuals.
  • You may also want to ignore that statement, drop it from your notes, and make sure it does not come up in any written form. Just pretend it never came up.

In either case, if behavior that clashes with theoretical ideals is not welcomed, the end result is that you lose precious data, either by omission or by making participants feel less comfortable in talking to you.

Strong emotional reactions are as good data as any architecture diagram for your work. They can highlight important and significant dynamics about your organization. Ignoring them is ignoring potentially useful data, and may damage the trust people put in you.

The approach I find more useful is one where the theoretical points you know and appreciate guide your actions. That statement is full of amazing hooks grab onto:

  • That they believe it is obvious but was not to the on-call engineer hints at a clash in their mental models, which is a great opportunity to compare and contrast them. Diverging perspectives like that are worth digging into because they can reveal a lot.
  • The thought that the runbook is complete and adequate is worth exploring: was the on-call engineer aware of it? Are runbooks considered trustworthy by all? Were they entertaining hypotheses or observing signals that pointed another direction? Is there any missing context?
  • That counterfactual point ("everything would have been fine!") is a good call-out for perspective. Does it mean next time we need to change nothing? Can we look into challenges around the current situation to help shape decision-making in the future?
  • Is this frustrated reaction pointing at patterns the engineer finds annoying? Does it hint at conflicts or lack of trust across teams?
  • Zooming out from the "root cause" with a newcomer's eyes can be a great way to get insights into a broader context: is this failure mechanism always easily identifiable? Are there false positives to care for? Has it changed recently? What's the broader context around this component? You can discuss "contributing factors" even when using the words "root cause" with people.

None of these requires interrupting or policing what the interviewee is telling you. The incident investigation itself becomes a place where various viewpoints are shared. The review should then be a place where everyone can broaden their understanding, and can form their own insights about how the socio-technical system works. Welcome the data, use it as a foothold for more discoveries.

If you do bring that testimony to the review (on top of having used it to inform the investigation), make sure you frame it in a way that feels safe and unsurprising for all participants involved. Respect the trust they've put in you.

How to do this, it turns out, is not something about which I have seen a lot of easily applicable theory. It's just hard. If I had to guess, I'd say there's a huge part of it that is tacit knowledge, which means you probably shouldn't wait on theory to learn how to do it. It's way too contextual and specific to your situation. If this is indeed the case, theory can be a fuzzy guideline for you at most, not a clear instruction set.

This is how I believe theory is most applicable: as a hidden guide you use to choose which paths to take, which actions to prefer. There's a huge gap between the idealized higher level models and the mess (or richness) of the real world situations you'll be in. Navigating that gap is a skill you'll develop over time. Theory does not need to be complete to provide practical insights for problem resolution. It is more useful as a personal north star than as a map. Others don't need to see it, and you can succeed without it.

Thanks to Clint Byrum for reviewing this text.

Permalink

¿Miscelánea o Procrastinación Encubierta?

Estaba pensando en escribir una nueva entrada en el blog, pero de repente me acordé que debía crear un fichero Markdown desde cero para eso y recordé que tenía esa característica en Lambdapad a medio terminar, cuando abrí el editor recordé que no había actualizado la versión de Elixir... ¿te ha pasado?

Permalink

The Demanding Work of Analyzing Incidents

2022/11/01

The Demanding Work of Analyzing Incidents

A few weeks ago, a coworker of mine was running an incident analysis in Jeli, and pointed out that the overall process was a big drag on their energy level, that it was hard to do, even if the final result was useful. They were wondering if this was a sign of learning what is significant or not as part of the analysis in order to construct a narrative.

The process we go through is a simplified version of the Howie guide, trading off thoroughness for time—same as using frozen veggies for dinner on a weeknight instead of locally farmed organic produce even if they'd be nicer. In this post, I want to specifically address that feeling of tiredness. I had written my coworker a large response which is now the backbone of this text, but also shared the ideas with folks of the LFI Community, whose points I have added here.

First of all I agree with my coworker that it's tedious. I also think that their guess is a good one—learning what is useful or not takes a bit of time, and it's really hard to do an in-depth analysis of everything because you're looking for unexpected stuff.

You do tend to get a feel for it over time, but the other thing I’d mention is that the technique used in incident analysis—reading, labeling and tagging data many times over—is something called Qualitative Coding Analysis. In actual papers and theses, you’d also calibrate your coding via inter-rater reliability measures. Essentially, the qualitative analysis looks at all the data, waits for patterns to emerge, which they then label, then ask scientists to look at the labels and apply them to the source material. If the hit rate is high, then the confidence in the label is higher given different people interpret events and themes in the same way.

This process ensures their thematic analysis is solid and not biased, meeting the standards of a scientific peer review. Academics tend to pick their methodology, method, interviewing, and tagging mechanisms very carefully because you have to be able to defend your whole research. When we tag our incidents through a tool like Jeli, we do an informal version of this. Our version is less rigorous (and therefore risks more bias and less thoroughness) but can still surface interesting insights in a somewhat short amount of time, just not in a way that would survive peer review.

Still, that [superficial] analysis is demanding. It's part of something called the Hermeneutic circle, which Ryan Kitchens described as looping on the same information continually with compounding 'lenses'. This is cognitively taxing, but useful to gain insights that wouldn't have been visible from your own initial perspective.

Ryan also pointed out that incident analysts should recognize that they are taking on an additional, distinct burden that no one else in the incident has when doing the analysis, and that impacts the energy level an incident may have on you.

Eric Dobbs for his part states:

So many times I feel myself get lost in one forrest looking for specific trees, then distracted by all the fascinating flora and fauna—then something snaps me out of it and I can’t remember what tree I was originally looking for. Finding my way back… It’s so exhausting.

All these efforts are done to surface themes. Themes are what lets you extract an interesting narrative out of all the noise of things that happen in an incident. I like to compare it to writing someone's biography. Lots of things happen in someone's life, and if you want to make a book about it worth reading, you're going to have to pick some elements to focus on, and events to ignore or describe in less detail. That's an editorial decision that can remain truthful or faithful to the experiences you want to convey, while choosing to shine a light on more significant elements.

This whole analysis serves the objective of learning from incidents. But learning isn't something you control or dictate. People will draw the lessons they'll draw, regardless of what you had planned for. All you can hope for is to provide the best environment possible for it to take place. In environments like tech, a lot hinges on people's mental models. We can't implant nor extract mental models, so challenging them through experience or discussion is the next best thing, and exposing how people were making decisions, the various factors and priorities they were juggling, or the challenges they were encountering are all key parts of their experience you wish to unveil.

In short:

  • It's normal to find it tiring, this is a bit like doing science (but with a lighter process)
  • It does get easier as you get a better feel of the sort of interesting stuff worth surfacing and discussing
  • Keep in mind that we don't control the learning and most of it gets done in active discussions and comparisons of people's understandings (mental models)
  • Your task is then more easily defined as finding good jumping points and constructing a narrative that lets your coworkers tell their story with high psychological safety (questions, confusion, people working in opposite directions are all good markers of models not being aligned)
  • The story-telling and discussion will take care of doing the teaching; if you want to write a review, look at what people were telling you in one-on-one discussions when preparing questions, what people were saying in the meeting, or the sort of questions people were asking.
  • If you do have enough interesting facets to your incident to run a discussion for an hour (the longest meeting duration we tend to use), you may start letting go and skimming the finer points. The fact we run these with meetings puts a complexity/thematic upper bound. If you do a written report only, it's tempting to just keep going as deep as possible, but long texts make for shallow readings.

A final note on the editorial stance of a written review that follows up your investigation: focus on themes you think were interesting, be descriptive more than prescriptive. It may make sense to note insights or patterns people highlighted or felt were noticeable, but don't pretend to have answers or the essence of what people should remember. I feel I'm doing a better job of writing a report when I consider the task to be an extension of incident review facilitation. Set the proper tone and present information so people can draw whatever lessons they can, but from what is hopefully a richer set of perspectives with varied points of views.

You're not there to tell them what was important or worth thinking about, but to give the best context for them to figure it out.

Thanks to Chad Todd for reviewing this text.

Permalink

My Future with Elixir: set-theoretic types

This is a three-articles series on My Future with Elixir, containing excerpts from my keynotes at ElixirConf Europe 2022 and ElixirConf US 2022.

In May 2022, we have celebrated 10 years since Elixir v0.5, the first public release of Elixir, was announced.

At such occasions, it may be tempting to try to predict how Elixir will look in 10 years from now. However, I believe that would be a futile effort, because, 10 years ago, I would never have guessed Elixir would have gone beyond excelling at web development, but also into domains such as embedded software and making inroads into machine learning and data analysis with projects such as Nx (Numerical Elixir), Explorer, Axon and Livebook. Elixir was designed to be extensible and how it will be extended has always been a community effort.

For these reasons, I choose to focus on My Future with Elixir. Those are the projects I am personally excited about and working on alongside other community members. The topic of today’s article is type systems, as discussed in my ElixirConf EU presentation in May 2022.

The elephant in the room: types

Throughout the years, the Elixir Core Team has addressed the biggest needs of the community. Elixir v1.6 introduced the Elixir code formatter, as the growing community and large teams saw an increased need for style guides and conventions around large codebases.

Elixir v1.9 shipped with built-in support for releases: self-contained archives that consist of your application code, all of its dependencies, plus the whole Erlang Virtual Machine (VM) and runtime. The goal was to address the perceived difficulty in deploying Elixir projects, by bringing tried approaches from both Elixir and Erlang communities into the official tooling. This paved the way to future automation, such as mix phx.gen.release, which automatically generates a Dockerfile tailored to your Phoenix applications.

Given our relationship with the community, it would be disingenuous to talk about my future with Elixir without addressing what seems to be the biggest community need nowadays: static typing. However, when the community asks for static typing, what are we effectively expecting? And what is the Elixir community to gain from it?

Types and Elixir

Different programming languages and platforms extract different values from types. These values may or may not apply to Elixir.

For example, different languages can extract performance benefits from types. However, Elixir still runs on the Erlang VM, which is dynamically typed, so we should not expect any meaningful performance gain from typing Elixir code.

Another benefit of types is to aid documentation (emphasis on the word aid as I don’t believe types replace textual documentation). Elixir already reaps similar benefits from typespecs and I would expect an integrated type system to be even more valuable in this area.

However, the upsides and downsides of static typing become fuzzier and prone to exaggerations once we discuss them in the context of code maintenance, in particular when comparing types with other software verification techniques, such as tests. In those situations, it is common to hear unrealistic claims such as “a static type system would catch 80% of my Elixir bugs” or that “you need to write fewer tests once you have static types”.

While I explore why I don’t believe those claims are true during the keynote, saying a static type system helps catch bugs is not helpful unless we discuss exactly the type of bugs it is supposed to identify, and that’s what we should focus on.

For example, Rust’s type system helps prevent bugs such as deallocating memory twice, dangling pointers, data races in threads, and more. But adding such type system to Elixir would be unproductive because those are not bugs that we run into in the first place, as those properties are guaranteed by the garbage collector and the Erlang runtime.

This brings another discussion point: a type system naturally restricts the amount of code we can write because, in order to prove certain properties about our code, certain styles have to be rejected. However, I would prefer to avoid restricting the expressive power of Elixir, because I am honestly quite happy with the language semantics (which we mostly inherited from Erlang).

For Elixir, the benefit of a type system would revolve mostly around contracts. If function caller(arg) calls a function named callee(arg), we want to guarantee that, as both these functions change over time, that caller is passing valid arguments into callee and that the caller properly handles the return types from callee.

This may seem like a simple guarantee to provide, but we’d run into tricky scenarios even on small code samples. For example, imagine that we define a negate function, that negates numbers. One may implement it like this:

def negate(x) when is_integer(x), do: -x

We could then say negate has the type integer() -> integer().

With our custom negation in hand, we can implement a custom subtraction:

def subtract(a, b) when is_integer(a) and is_integer(b) do
  a + negate(b)
end

This would all work and typecheck as expected, as we are only working with integers. However, imagine in the future someone decides to make negate polymorphic, so it also negates booleans:

def negate(x) when is_integer(x), do: -x
def negate(x) when is_boolean(x), do: not x

If we were to naively say that negate now has the type integer() | boolean() -> integer() | boolean(), we would now get a false positive warning in our implementation of subtract:

Type warning:

  |
  |  def subtract(a, b) when is_integer(a) and is_integer(b) do
  |    a + negate(b)
         ^ the operator + expects integer(), integer() as arguments,
           but the second argument can be integer() | boolean()

So we want a type system that can type contracts between functions but, at the same time, avoids false positives and does not restrict the Elixir language. Balancing those trade-offs is not only a technical challenge but also one that needs to consider the needs of the community. The Dialyzer project, implemented in Erlang and available for Elixir projects, chose to have no false positives. However, that implies certain bugs may not be caught.

At this point in time, it seems the overall community would prefer a system that flags more potential bugs, even if it means more false positives. This may be particularly tricky in the context of Elixir and Erlang because I like to describe them as assertive languages: we write code that will crash in face of unexpected scenarios because we rely on supervisors to restart parts of our application whenever that happens. This is the foundation of building self-healing and fault-tolerant systems in those languages.

On the other hand, this is what makes a type system for Erlang/Elixir so exciting and unique: the ability to deal with failure modes both at compile-time and runtime elegantly. Because at the end of the day, regardless of the type system of your choice, you will run into unexpected scenarios, especially when interacting with external resources such as the filesystem, APIs, distributed nodes, etc.

The big announcement

This brings me to the big announcement from ElixirConf EU 2022: we have an on-going PhD scholarship to research and develop a type system for Elixir based on set-theoretic types. Guillaume Duboc (PhD student) is the recipient of the scholarship, lead by Giuseppe Castagna (Senior Resercher) with support from José Valim (that’s me).

The scholarship is a partnership between the CNRS and Remote. It is sponsored by Supabase (they are hiring!), Fresha (they are hiring!), and Dashbit, all heavily invested in Elixir’s future.

Why set-theoretic types?

We want a type system that can elegantly model all of Elixir idioms and, at a first glance, set-theoretic types were an excellent match. In set-theoretic types, we use set operations to define types and ensure that the types satisfy the associativity and distributivity properties of the corresponding set-theoretic operations.

For example, numbers in Elixir can be integers or floats, therefore we can write them as the union integer() | float() (which is equivalent to float() | integer()).

Remember the negate function we wrote above?

def negate(x) when is_integer(x), do: -x
def negate(x) when is_boolean(x), do: not x

We could think of it as a function that has both types (integer() -> integer()) and (boolean() -> boolean()), which is as an intersection. This would naturally solve the problem described in the previous section: when called with an integer, it can only return an integer.

We also have a data-structure called atoms in Elixir. They uniquely represent a value which is given by their own name. Such as :sunday or :banana. You can think of the type atom() as the set of all atoms. In addition, we can think of the values :sunday and :banana as subtypes of atom(), as they are contained in the set of all atoms. :sunday and :banana are also known as singleton types (as they are made up of only one value).

In fact, we could even consider each integer to be a singleton type that belongs to the integer() set. The choice of which values will become singletons in our type system will strongly depend on the trade-offs we defined in the previous sections.

Furthermore, the type system has to be gradual, as any typed Elixir code would have to interact with untyped Elixir code.

Personally, I find set-theoretical types an elegant and accessible approach to reason about types. At the end of the day, an Elixir developer won’t have to think about intersections when writing a function with multiple clauses, but the modelling is straight-forward if they are ever to look under the hood.

Despite the initial fit between Elixir semantics and set-theoretic types, there are open questions and existing challenges in putting the two together. Here are some examples:

  • Elixir has an expressive collection of idioms used in pattern matching and guards, can we map them all to set-theoretic types?

  • Elixir associative data structures, called maps, can be used both as records and as dictionaries. Would it be possible to also type them with a unified foundation?

  • Gradual type systems must introduce runtime type checks in order to remain sound. However, those type checks will happen in addition to the checks already done by the Erlang VM, which can degrade performance. Therefore, is it possible to leverage the existing runtime checks done by the Erlang VM so the resulting type system is still sound?

Those challenges are precisely what makes me excited to work with Giuseppe Castagna and Guillaume Duboc, as we believe it is important to formalize those problems and their solutions, before we dig deep into the implementation. To get started with set-theoretic types, I recommend Programming with union, intersection, and negation types by Giuseppe Castagna.

Finally, it is important to note there are areas we don’t plan to tackle at the moment, such as typing of messages between processes.

Expectations and roadmap

At this point, you may be expecting that Elixir will certainly become a gradually typed language at some moment in its future. However, it is important to note this may not be the case, as there is a long road ahead of us.

One of the challenges in implementing a type system - at least for someone who doesn’t have the relevant academic background like myself - is that it feels like a single indivisible step: you take a language without a type system and at the end you have one, without much insight or opportunity for feedback in the middle. Therefore, we have been planning to incorporate the type system into Elixir in steps, which I have been referring to as “a gradual gradual type system”: one where we add gradual types to the language gradually.

The first step, the one we are currently working on, is to leverage the existing type information found in Elixir programs. As previously mentioned, we write assertive code in Elixir, which means there is a lot of type information in patterns and guards. We want to lift this information and use it to type check existing codebases. The Erlang compiler already does so to improve performance within a single module and we want to eventually do so across modules and applications too.

During this phase, Elixir developers won’t have to change a single line of code to leverage the benefits of the type system. Of course, we will catch only part of existing bugs, but this will allows us to stress test, benchmark, and collect feedback from developers, making improvements behind the scenes (or even revert the whole thing if we believe it won’t lead us where we expect).

The next step is to introduce typed structs into the language, allowing struct types to propagate throughout the system, as you pattern match on structs throughout the codebase. In this stage we will introduce a new API for defining structs, yet to be discussed, and developers will have to use the new API to reap its benefits.

Then finally, once we are happy with the improvements and the feedback collected, we can migrate to introduce a new syntax for typing function signatures in Elixir codebases, including support for more advanced features such as polymorphic types. Those will allow us to type complex constructs such as the ones found in the Enum module.

The important point to keep in mind is that those features will be explored and developed in steps, with plenty of opportunity to gather community feedback. I also hope our experience may be useful to other ecosystems who wish to gradually introduce type systems into existing programming languages, in a way that feels granular and participative.

Thank you for reading and see you in a future article of the “My Future with Elixir” series.

Permalink

Erlang/OTP 25.1 Release

OTP 25.1

Erlang/OTP 25.1 is the first maintenance patch package for OTP 25, with mostly bug fixes as well as quite many small improvements.

Below are some highlights of the release:

crypto:

  • Crypto is now considered to be usable with the OpenSSL 3.0 cryptolib for production code. ENGINE and FIPS are not yet fully functional.

  • Changed the behaviour of the engine load/unload functions

ssl:

  • A vulnerability has been discovered and corrected. It is registered as CVE-2022-37026 “Client Authentication Bypass”. Corrections have been released on the supported tracks with patches 23.3.4.15, 24.3.4.2, and 25.0.2. The vulnerability might also exist in older OTP versions. We recommend that impacted users upgrade to one of these versions or later on the respective tracks. OTP 25.1 would be an even better choice. Impacted are those who are running an ssl/tls/dtls server using the ssl application either directly or indirectly via other applications. For example via inets (httpd), cowboy, etc. Note that the vulnerability only affects servers that request client certification, that is sets the option {verify, verify_peer}.

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here

Permalink

Elixir v1.14 released

Elixir v1.14 has just been released. 🎉

Let’s check out new features in this release. Like many of the past Elixir releases, this one has a strong focus on developer experience and developer happiness, through improvements to debugging, new debugging tools, and improvements to term inspection. Let’s take a quick tour.

Note: this announcement contains asciinema snippets. You may need to enable 3rd-party JavaScript on this site in order to see them. If JavaScript is disabled, noscript tags with the proper links will be shown.

dbg

Kernel.dbg/2 is a new macro that’s somewhat similar to IO.inspect/2, but specifically tailored for debugging.

When called, it prints the value of whatever you pass to it, plus the debugged code itself as well as its location.

<noscript><p><a href="https://asciinema.org/a/510632">See the example in asciinema</a></p></noscript>

dbg/2 can do more. It’s a macro, so it understands Elixir code. You can see that when you pass a series of |> pipes to it. dbg/2 will print the value for every step of the pipeline.

<noscript><p><a href="https://asciinema.org/a/509506">See the example in asciinema</a></p></noscript>

IEx + dbg

Interactive Elixir (IEx) is Elixir’s shell (also known as REPL). In v1.14, we have improved IEx breakpoints to also allow line-by-line stepping:

<noscript><p><a href="https://asciinema.org/a/508048">See the example in asciinema</a></p></noscript>

We have also gone one step further and integrated this new functionality with dbg/2.

dbg/2 supports configurable backends. IEx automatically replaces the default backend by one that halts the code execution with IEx:

<noscript><p><a href="https://asciinema.org/a/509507">See the example in asciinema</a></p></noscript>

We call this process “prying”, as you get access to variables and imports, but without the ability to change how the code actually executes.

This also works with pipelines: if you pass a series of |> pipe calls to dbg (or pipe into it at the end, like |> dbg()), you’ll be able to step through every line in the pipeline.

<noscript><p><a href="https://asciinema.org/a/509509">See the example in asciinema</a></p></noscript>

You can keep the default behavior by passing the --no-pry option to IEx.

dbg in Livebook

Livebook brings the power of computation notebooks to Elixir.

As an another example of the power behind dbg, the Livebook team has implemented a visual representation for dbg as a backend, where it prints each step of the pipeline as its distinct UI element. You can select an element to see its output or even re-order and disable sections of the pipeline on the fly:

PartitionSupervisor

PartitionSupervisor implements a new supervisor type. It is designed to help when you have a single supervised process that becomes a bottleneck. If that process’ state can be easily partitioned, then you can use PartitionSupervisor to supervise multiple isolated copies of that process running concurrently, each assigned its own partition.

For example, imagine you have a ErrorReporter process that you use to report errors to a monitoring service.

# Application supervisor:
children = [
  # ...,
  ErrorReporter
]

Supervisor.start_link(children, strategy: :one_for_one)

As the concurrency of your application goes up, the ErrorReporter process might receive requests from many other processes and eventually become a bottleneck. In a case like this, it could help to spin up multiple copies of the ErrorReporter process under a PartitionSupervisor.

# Application supervisor
children = [
  {PartitionSupervisor, child_spec: ErrorReporter, name: Reporters}
]

The PartitionSupervisor will spin up a number of processes equal to System.schedulers_online() by default (most often one per core). Now, when routing requests to ErrorReporter processes we can use a :via tuple and route the requests through the partition supervisor.

partitioning_key = self()
ErrorReporter.report({:via, PartitionSupervisor, {Reporters, partitioning_key}}, error)

Using self() as the partitioning key here means that the same process will always report errors to the same ErrorReporter process, ensuring a form of back-pressure. You can use any term as the partitioning key.

A common example

A common and practical example of a good use case for PartitionSupervisor is partitioning something like a DynamicSupervisor. When starting many processes under it, a dynamic supervisor can be a bottleneck, especially if said processes take a long time to initialize. Instead of starting a single DynamicSupervisor, you can start multiple:

children = [
  {PartitionSupervisor, child_spec: DynamicSupervisor, name: MyApp.DynamicSupervisors}
]

Supervisor.start_link(children, strategy: :one_for_one)

Now you start processes on the dynamic supervisor for the right partition. For instance, you can partition by PID, like in the previous example:

DynamicSupervisor.start_child(
  {:via, PartitionSupervisor, {MyApp.DynamicSupervisors, self()}},
  my_child_specification
)

Improved errors on binaries and evaluation

Erlang/OTP 25 improved errors on binary construction and evaluation. These improvements apply to Elixir as well. Before v1.14, errors when constructing binaries would often be hard-to-debug, generic “argument errors”. Erlang/OTP 25 and Elixir v1.14 provide more detail for easier debugging. This work is part of EEP 54.

Before:

int = 1
bin = "foo"
int <> bin
#=> ** (ArgumentError) argument error

Now:

int = 1
bin = "foo"
int <> bin
#=> ** (ArgumentError) construction of binary failed:
#=>    segment 1 of type 'binary':
#=>    expected a binary but got: 1

Code evaluation (in IEx and Livebook) has also been improved to provide better error reports and stacktraces.

Slicing with Steps

Elixir v1.12 introduced stepped ranges, which are ranges where you can specify the “step”:

Enum.to_list(1..10//3)
#=> [1, 4, 7, 10]

Stepped ranges are particularly useful for numerical operations involving vectors and matrices (see Nx, for example). However, the Elixir standard library was not making use of stepped ranges in its APIs. Elixir v1.14 starts to take advantage of steps with support for stepped ranges in a couple of functions. One of them is Enum.slice/2:

letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
Enum.slice(letters, 0..5//2)
#=> ["a", "c", "e"]

binary_slice/2 (and binary_slice/3 for completeness) has been added to the Kernel module, that works with bytes and also support stepped ranges:

binary_slice("Elixir", 1..5//2)
#=> "lxr"

Expression-based Inspection and Inspect Improvements

In Elixir, it’s conventional to implement the Inspect protocol for opaque structs so that they’re inspected with a special notation, resembling this:

MapSet.new([:apple, :banana])
#MapSet<[:apple, :banana]>

This is generally done when the struct content or part of it is private and the %name{...} representation would reveal fields that are not part of the public API.

The downside of the #name<...> convention is that the inspected output is not valid Elixir code. For example, you cannot copy the inspected output and paste it into an IEx session.

Elixir v1.14 changes the convention for some of the standard-library structs. The Inspect implementation for those structs now returns a string with a valid Elixir expression that recreates the struct when evaluated. In the MapSet example above, this is what we have now:

fruits = MapSet.new([:apple, :banana])
MapSet.put(fruits, :pear)
#=> MapSet.new([:apple, :banana, :pear])

The MapSet.new/1 expression evaluates to exactly the struct that we’re inspecting. This allows us to hide the internals of MapSet, while keeping it as valid Elixir code. This expression-based inspection has been implemented for Version.Requirement, MapSet, and Date.Range.

Finally, we have improved the Inspect protocol for structs so that fields are inspected in the order they are declared in defstruct. The option :optional has also been added when deriving the Inspect protocol, giving developers more control over the struct representation. See the updated documentation for Inspect for a general rundown on the approaches and options available.

Learn more

For a complete list of all changes, see the full release notes.

Check the Install section to get Elixir installed and read our Getting Started guide to learn more.

Happy debugging!

Permalink

Debugging a Slow Starting Elixir Application

I recently had to fix an Elixir service that was slow to start. I was able to pinpoint the issue with only a few commands and I want to share a couple of the things I learned.

Dependencies

In Elixir all dependencies are “applications”. The term “application” means something different than it does outside of Elixir. In Elixir an “application” is a set of modules and behaviors. Some of these applications define their own supervision trees and must be started by Application.start/2 before they can be used. When you start your Elixir service, either via Mix or a generated Elixir release, the dependencies you specified in your mix.exs file will be started before your own code is started. If an application listed as a dependency is slow to start your application must wait until the dependency is running before it can be started.

While the behavior is simple it is recursive. Each application has its own set of dependencies that must be running before that application can be started, and some of those dependencies have dependencies of their own that must be running before they can start. This results in a dependency tree structure. To illustrate this with a little ASCII:

- your_app
  - dependency_1
    - hidden_dependency_1
    - hidden_dependency_2
  - dependency_2
    - hidden_dependency_3

For this application, the Erlang VM would likely start these applications in this order:

  1. hidden_dependency_3

  2. dependency_2

  3. hidden_dependency_2

  4. hidden_dependency_1

  5. dependency_1

  6. your_app

The application I had to fix had a lot of dependencies. Profiling each application would be tedious and time-consuming, and I had a hunch there was probably a single dependency that was the problem. Turns out it’s pretty easy to write a little code that times the start up of each application.

Profiling

Start an IEx shell the --no-start flag so that the application is available but not yet loaded or started:

iex -S mix run --no-start

Then load this code into the shell:

defmodule StartupBenchmark do
  def run(application) do
    complete_deps = deps_list(application) # (1)

    dep_start_times = Enum.map(complete_deps, fn(app) -> # (2)
      case :timer.tc(fn() -> Application.start(app) end) do
        {time, :ok} -> {time, app}
        # Some dependencies like :kernel may have already been started, we can ignore them
        {time, {:error, {:already_started, _}}} -> {time, app}
        # Raise an exception if we get an non-successful return value
        {time, error} -> raise(error)
      end
    end)

    dep_start_times
    |> Enum.sort() # (3)
    |> Enum.reverse()
  end

  defp deps_list(app) do
    # Get all dependencies for the app
    deps = Application.spec(app, :applications)

    # Recursively call to get all sub-dependencies
    complete_deps = Enum.map(deps, fn(dep) -> deps_list(dep) end)

    # Build a complete list of sub dependencies, with the top level application
    # requiring them listed last, also remove any duplicates
    [complete_deps, [app]]
    |> List.flatten()
    |> Enum.uniq()
  end
end

To highlight the important pieces from this module:

  1. Recursively get all applications that must be started in the order they need to be started in.

  2. Start each application in order; timing each one.

  3. Sort applications by start time so the slowest application is the first item in the list.

With this code finding applications that are slow to start is easy:

> StartupBenchmark.run(:your_app)
[
  {6651969, :prometheus_ecto},
  {19621, :plug_cowboy},
  {14336, :postgrex},
  {13598, :ecto_sql},
  {5123, :yaml_elixir},
  {3871, :phoenix_live_dashboard},
  {1159, :phoenix_ecto},
  {123, :prometheus_plugs},
  {64, :ex_json_logger},
  {56, :prometheus_phoenix},
  {56, :ex_ops},
  {36, :kernel},
  ...
]

These times are in microseconds so in this case prometheus_ecto is taking 6.6 seconds to start. All other applications are taking less than 20 milliseconds to start and many of them are taking less than 1 millisecond to start. prometheus_ecto is the culprit here.

Conclusion

With the code above I was able to identify prometheus_ecto as the problem. With this information I was then able to use eFlambe and a few other tools to figure out why prometheus_ecto was so slow and quickly fixed the issue.

I hope the snippet of code above will be helpful to some of you. If you like reading my blog posts please subscribe to my newsletter. I send emails out once a month with my latest posts.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.