Software Acceleration and Desynchronization

A bit more than a month ago, I posted the following on social media:

Seeing more reports and industry players blaming code reviews for slowing down the quick development done with AI. It's unclear whether anyone's asking if this is just moving the cognitive bottleneck of "understanding what's happening" around. "Add AI to the reviews" seems to be the end goal here.

And I received multiple responses, some that were going "This is a terrible thing" and some going "yeah that's actually not a bad idea." Back then I didn't necessarily have the concepts to clarify these thoughts, but I've since found ways to express the issue in a clearer, more system-centric way. While this post is clearly driven by the discourse around AI (particularly LLMs), it is more of a structural argument about the kind of changes their adoption triggers, and the broader acceleration patterns seen in the industry with other technologies and processes before, and as such, I won’t really mention them anymore here.

The model I’m proposing here is inspired by (or is a dangerously misapplied simplification of) the one presented by Hartmut Rosa’s Social Acceleration,1 bent out of shape to fit my own observations. A pattern I’ll start with is one of loops, or cycles.

Loops, Cycles, and Activities

Let’s start with a single linear representation of the work around writing software:

A directed graph containing: plan work → write code → code reviews → deploy

This is a simplification because we could go much deeper, such as in this image of what value stream mapping could look like in the DORA report:2

DORA report 2025 Figure 50: value stream mapping, which contains elements in a sequence such as backlog, analyse codebase, code, generate unit tests, security analysis, code commit, production, etc. The gap between code commit and production is expanded to show sub-steps like create merge, code review, build, deploy to QA, deploy bugfix, etc.

But we could also go for less linear to show a different type of complexity, even with a simplified set of steps:

A directed graph containing: Plan work → scope ticket → write tests → write code → self-review → code review (peers) → merge → ship; each step between plan work and merge also link back to multiple previous steps, representing going back to the drawing board. Specific arrows also short-circuit Plan work → write code → self-review → merge → ship (and self-review → ship). The sequence then circles back from ship to plan work as a new task is selected.

Each of the steps above can imply a skip backwards to an earlier task, and emergencies can represent skips forwards. For the sake of the argument, it doesn't matter that our model is adequately detailed or just a rough estimation; we could go for more or less accurate (the “write tests” node could easily be expanded to fill a book), this is mostly for illustrative purposes.

Overall, in all versions, tasks aim to go as quickly as possible from beginning to end, with an acceptable degree of quality. In a mindset of accelerating development, we can therefore take a look at individual nodes (writing code, debugging, or reviewing code) for elements to speed up, or at overall workflows by influencing the cycles themselves.

For example, code reviews can be sped up with auto formatting and linting—automated rule checks that enforce standards or prevent some practices—which would otherwise need to be done by hand. This saves time and lets people focus on higher-level review elements. And the overall cycle can be made faster by moving these automated rules into the development environment, thereby tightening a feedback loop: fix as you write rather than accumulating flaws on top of which you build, only to then spend time undoing things to fix their foundations.

Concurrent Loops

So far, so good. The problem though is that this one isolated loop is insufficient to properly represent most of software work. Not only are there multiple tasks run in parallel by multiple people each representing independent loops, each person also is part of multiple loops as well. For example, while you might tackle a ticket for some piece of code, you may also have to write a design document for an upcoming project, provide assistance on a support ticket, attend meetings, focus on developing your career via mentorship sessions, and keep up with organizational exercises through publishing and reading status reports, and so on.

Here's a few related but simplified loops, as a visual aid:

Four different loops, all disconnected, are represented here: coding, assisting support, career development, writing a design document. The graphs are fairly complex and of varying length. Coding has 8 nodes in a loop; assisting support has 5 nodes with a branching path, writing a design doc has 10 nodes in a loop, and the career development one has 10 nodes, but it is full of cycles and references (via color and labels) the coding and design document loops as if they were a subtask.

Once again, each person on your team may run multiple of these loops in parallel during their workday, of various types.

But more than that, there might be multiple loops that share sections. You can imagine how writing code in one part of the system can prepare you or improve your contributions to multiple sorts of tasks: writing code that interacts with it, modifying it, reviewing changes, writing or reviewing docs, awareness of possible edge cases for incidents, better estimates of future tasks, and so on.

You can also imagine how, in planning how to best structure code for new product changes, experience with the current structure of the code may matter, along with awareness of the upcoming product ambitions. Likewise, the input of people familiar with operational challenges of the current system can prove useful in prioritizing changes. This sort of shared set of concerns informed ideas like DevOps, propelled by the belief that good feedback and integration (not throwing things over the fence) would help software delivery.

Basically, a bunch of loops can optionally contribute to one set of shared activities, but some activities can also be contributors to multiple loops, and these loops might be on multiple time scales:

A coding loop, a career growth loop, and a high-level architecture loop are all depicted as concurrent. However, the three loops share a step. For the coding loop, this is code review, for the growth loop it is about awareness of teams work and norm enforcement, and for the architecture loop, it represents change awareness.

Here, the activity of reviewing code might be the place where the coding loops gets straightforward reviews as desired, but it is also a place where someone's career growth plans can be exercised in trying to influence or enforce norms, and where someone looking at long-term architectural and growth pattern gets to build awareness of ongoing technical changes.

These shared bits are one of the factors that can act like bottlenecks, or can counter speed improvements. To make an analogy, consider the idea that if you were cycling to work 30 minutes each way every day, and sped up your commute by going twice as fast via public or private transit, you’d save 2h30 every week; however some of that time wouldn’t be “saved” if you consider you might still need to exercise to stay as fit physically. You would either need to spend much of your time saved on exercise outside of commute, or end up incidentally trading off commute time for longer-term health factors instead.

Applied to software, we may see this pattern with the idea of “we can now code faster, but code review is the new bottleneck.” The obvious step will be to try to speed up code reviewing to match the increased speed of code writing. To some extent, parts of code reviewing can be optimized. Maybe we can detect some types of errors more reliably and rapidly through improvements. Again, like linting or type checking, these ideally get moved to development rather than reviews.

But code reviewing is not just about finding errors. It is also used to discuss maintainability, operational concerns, to spread knowledge and awareness, to get external perspectives, or to foster broader senses of ownership. These purposes, even if they could be automated or sped up, can all indicate the existence of other loops that people may have to maintain regardless.

Synchronization and Desynchronization

If we decide to optimize parts of the work, we can hope for a decent speedup if we do one of:

  1. Continuously develop a proper understanding of the many purposes a task might have, to make useful, well-integrated changes
  2. give up some of the shared purposes by decoupling loops

The first option is challenging and tends to require research, iteration, and an eye for ergonomics. Otherwise you’ll quickly run into problems of “working faster yet going the same speed”, where despite adopting new tools and methods, the bottlenecks we face remain mostly unchanged. Doing this right implies your changes are made knowing they'll structurally impact work when attempting to speed it up, and be ready to support these disruptions.

The second is easy to do in ways that accidentally slow down or damage other loops—if the other purposes still exist, new activities will need to replace the old ones—which may in turn feed back into the original loop (e.g.: code reviews may block present code writing, but also inform future code writing), with both being weakened or on two different tempos when decoupled. This latter effect is something we’ll call “desynchronization.” One risk of being desynchronized is that useful or critical feedback from one loop no longer makes it to another one.

To cope with this (but not prevent it entirely), we have a third option in terms optimization:

  1. Adopt “norms” or practices ahead of time that ensure alignment and reduce the need for synchronization.

This is more or less what “best practices” and platforms attempt to provide: standards that when followed, reduce the need for communication and sense making. These tend to provide a stable foundation on which to accelerate multiple activities. These don’t fully prevent desynchronization, they just stave it off.

To illustrate desynchronization, let’s look at varied loops that could feed back into each other:

5 layers of loops are shown: platform updates, on a long cycle; repeated coding loops on short cycles; reactive ops loops that interact with coding loops; sparse configuration and optimization tasks; and a long loop about norms. The coding and ops loops interact at synchronous points for code reviews, but are otherwise asynchronous. Learnings from running the software feed back into platform and norm loops, which eventually inform the coding loops.

These show shared points where loops synchronize, across ops and coding loops, at review time. The learnings from operational work can feed back into platform and norms loops, and the code reviews with ops input are one of the places where these are "enforced".3 If you remove these synchronization points, you can move faster, but loops can also go on independently for a while and will grow further and further apart:

5 layers of loops are shown: platform updates, on a long cycle; repeated coding loops on short cycles; reactive ops loops that interact with coding loops; sparse configuration and optimization tasks; and a long loop about norms. The coding and ops loops no longer hold synchronous points for code reviews and are fully asynchronous. Learnings from running the software feed back into platform and norm loops, but they do not properly inform the coding loop anymore and the operations loop has to repeat tasks to enforce norms.

There's not a huge difference across both images, but what I chose to display here is that lack of dev-time ops input (during code review) might lead to duplicated batches of in-flight fixes that need to be carried and applied to code as it rolls out, with extra steps peppered through. As changes are made to the underlying platform or shared components, their socialization may lag behind as opportunities to propagate them are reduced. If development is sped up enough without a matching increase in ability to demonstrate the code's fitness (without waiting for more time in production), the potential for surprises goes up.

Keep in mind that this is one type of synchronization across one shared task between two high-level loops. Real work has more loops, with more nodes, more connections, and many subtler synchronization points both within and across teams and roles. Real loops might be more robust, but less predictable. A loop with multiple synchronization points can remove some of them and look faster until the few remaining synchronization points either get slower (to catch up) or undone (to go fast).

Not all participants to synchronization points get the same thing out of them either. It’s possible an engineer gets permission (and protection) from one, another gets awareness, some other team reinforces compliance, and a management layer claims accountability out of it happening, for example.

It's easy to imagine both ends of a spectrum, with on one end, organizations that get bogged down on synchronous steps to avoid all surprises, and on the other, organizations that get tangled into the web of concurrent norms and never-deprecated generations of the same stuff all carried at once because none of the synchronous work happens.

Drift that accumulates across loops will create inconsistencies as mental models lag, force corner cutting to keep up with changes and pressures, widen gaps between what we think happens and what actually happens.4 It pulls subsystems apart, weakens them, and contributes to incidents—unintended points of rapid resynchronization.

I consider incidents to be points of rapid resynchronization because they're usually where you end up desynchronizing so hard, incident response forces you to suspend your usual structure, quickly reprioritize, upend your roadmap, and (ideally) have lots of people across multiple teams suddenly update their understanding of how things work and break down. That the usual silos can't keep going as usual points to forced repair after too much de-synchronization.

As Rosa points out in his book, this acceleration tends to grow faster than what the underlying stable systems can support, and they become their own hindrances. Infrastructure and institutions are abandoned or dismantled when the systems they enabled gradually feel stalled or constrained by them, and seek alternatives:

[Acceleration] by means of institutional pausing and the guaranteed maintenance of background conditions is a basic principle of the modern history of acceleration and an essential reason for its success as well. [Institutions] were themselves exempted from change and therefore helped create reliable expectations, stable planning, and predictability. [...] Only against the background of such stable horizons of expectation does it become rational to make the long-term plans and investments that were indispensable for numerous modernization processes. The erosion of those institutions and orientations as a result of further, as it were, “unbounded” acceleration [...], might undermine their own presuppositions and the stability of late modern society as a whole and thereby place the (accelerative) project of modernity in greater danger than the antimodern deceleration movement.

The need for less synchronization doesn’t mean that synchronization no longer needs to happen. The treadmill never slows down, and actors in the system must demonstrate resilience to reinvent practices and norms to meet demands. This is particularly obvious when the new pace creates new challenges: what brought us here won’t be enough to keep going, and we’ll need to overhaul a bunch of loops again.

There’s something very interesting about this observation: A slowdown in one place can strategically speed up other parts.

Is This Specific Slow a Good Slow or a Bad Slow?

There’s little doubt to me that one can go through a full cycle of the “write code” loop faster than one would go through “suffering the consequences of your own architecture” loop—generally that latter cycle depends on multiple development cycles to get adequate feedback. You can ship code every hour, but it can easily take multiple weeks for all the corner cases to shake out.

When operating at the level of system design or software architecture (“We need double-entry bookkeeping that can tolerate regional outages”), we tend to require an understanding of the system’s past, a decent sense of its present with its limitations, and an ability to anticipate future challenges to inform the directions in which to push change. This is a different cycle from everyday changes (“The feature needs a transaction in the ledger”), even if both are connected.

The implication here is that if you’re on a new code base with no history and a future that might not exist (such as short-term prototypes or experiments), you’re likely to be able to have isolated short loops. If you’re working on a large platform with thousands of users, years of accrued patterns and edge cases, and the weight of an organizational culture to fight or align with, you end up relying on the longer loops to inform the shorter ones.

The connections across loops accrue gradually over time, and people who love the short loops get very frustrated at how slow they’re starting to be:

Yet irreversible decisions require significantly more careful planning and information gathering and are therefore unavoidably more time intensive than reversible ones. In fact, other things equal, the following holds: the longer the temporal range of a decision is, the longer the period of time required to make it on the basis of a given substantive standard of rationality. This illustrates the paradox of contemporary temporal development: the temporal range of our decisions seems to increase to the same extent that the time resources we need to make them disappear.

That you have some folks go real fast and reap benefits while others feel bogged down in having to catch up can therefore partially be a sign that we haven’t properly handled synchronization and desynchronization. But it can also be a function of people having to deliberately slow down their work when its output either requires or provides the stability required by the fast movers. Quick iterations at the background level—what is generally taken for granted as part of the ecosystem—further increases the need for acceleration from all participants.

In a mindset of acceleration, we will seek to speed up every step we can, through optimization, technological innovation, process improvements, economies of scale, and so on. This connects to Rosa’s entire thesis of acceleration feeding into itself.5 One of the point Rosa makes, among many, is that we need to see the need for acceleration and the resulting felt pressures (everything goes faster, keeping up is harder; therefore we need to do more as well) as a temporal structure, which shapes how systems work. So while technical innovation offers opportunities to speed things up (often driven by economic forces), these innovations transform how our social structures are organized (often through specialization), which in turn, through a heightened feeling of what can be accomplished and a feeling that the world keeps going faster, provokes a need to speed things up further and fuels technological innovation. Here's the diagram provided in his book:

Three steps feed into each other: technical acceleration leads to an acceleration of social change, which leads to an acceleration of the pace of life, which leads to technical acceleration. While self-sustaining, each of these steps is also propelled by an external 'driver': the economic driver (time is money) fuels technical acceleration, functional differentiation (specialization) fuels social change, and the promise of acceleration (doing/experiencing more within your lifetime) fuels the acceleration of the pace of life.

We generally frame acceleration as an outcome of technological progress, but the idea here is that the acceleration of temporal structures is, on its own, a mechanism that shapes society (and, of course, our industry). Periods of acceleration also tend to come with multiple forms of resistance; while some are a bit of a reflex to try and keep things under control (rather than having to suffer more adaptive cycles), there are also useful forms of slowing down, those which can provide stability and lengthen horizons of other acceleration efforts.

Few tech companies have a good definition of what productivity means, but the drive to continually improve it is nevertheless real. Without a better understanding of how work happens, we’re likely to keep seeing a wide variation in how people frame the impact of new tech on their work as haphazard slashing and boosting of random parts of random work loops. I think this overall dynamic can provide a useful explanation for why some people, despite being able to make certain tasks much faster, either don't feel overall more productive, or actually feel like they don't save time and it creates more work. It's hard to untangle which type of slowdown is being argued for at times, but one should be careful to classify all demands of slowing down as useless Luddite grumblings.6 It might be more useful down the road to check whether you could be eroding your own foundations without a replacement.

What do I do with this?

A systems-thinking approach tends to require a focus on interactions over components. What the model proposed here does is bring a temporal dimension to these interactions. We may see tasks and activities done during work as components of how we produce software. The synchronization requirements and feedback pathways across these loops and for various people are providing a way to map out where they meet.

Ultimately even the loop model is a crude oversimplification. People are influenced by their context and influence their context back in a continuous manner that isn’t possible to constrain to well-defined tasks and sequences. Reality is messier. This model could be a tool to remind ourselves that no acceleration happens in isolation. Each effort contains the potential for desynchronization, and for a resulting reorganization of related loops. In some ways, the aim is not to find specific issues, but to find potential mismatches in pacing, which suggest challenges in adapting and keeping up.

The analytical stance adopted matters. Seeking to optimize tasks in isolation can sometimes yield positive local results, within a single loop, and occasionally at a wider scale. Looking across loops in all its tangled mess, however, is more likely to let you see what’s worth speeding up (or slowing down to speed other parts up!), where pitfalls may lie, and foresee where the needs for adjustments will ripple on and play themselves out. Experimentation and ways to speed things up will always happen and will keep happening, unless something drastically changes in western society; experimenting with a better idea of what to look for in terms of consequences is not a bad idea.


1: While I have not yet published a summary or quotes from it in my notes section, it's definitely one of the books that I knew from the moment I started reading it would have a huge influence in how I frame stuff, and as I promised everyone around me who saw me reading it: I'm gonna be very annoying when I'll be done with it. Well, here we are. Grab a copy of Social Acceleration: A New Theory of Modernity. Columbia University Press, 2013.

2: Original report, figure 50 is on p. 75.

3: This example isn't there to imply that the synchronization point is necessary, nor that it is the only one, only that it exists and has an impact. This is based on my experience, but I have also seen multiple synchronization points either in code review or in RFC reviews whenever work crosses silo boundaries across teams and projects become larger in organizational scope.

4: I suspect it can also be seen as a contributor to concepts such as technical debt, which could be framed as a decoupling between validating a solution and engineering its sustainability.

5: I believe this also connects to the Law of Stretched Systems in cognitive systems engineering, and might overall be one of these cases where multiple disciplines find similar but distinct framings for similar phenomena.

6: Since I'm mentioning Luddism, I need to do the mandatory reference to Brian Merchant's Blood in the Machine, which does a good job at reframing Luddism in its historical context as a workers' movement trying to protect their power over their own work at the first moments of the Industrial Revolution. Luddites did not systemically resist or damage all new automation technology, but particularly targeted the factory owners that offered poor working conditions while sparing the others.

Permalink

Erlang/OTP 28.3 Release

OTP 28.3

Erlang/OTP 28.3 is the second maintenance patch package for OTP 28, with mostly bug fixes as well as improvements.

POTENTIAL INCOMPATIBILITIES

  • Adjustment in ssh_file module allowing inclusion of Erlang/OTP license in test files containing keys.

HIGHLIGHTS

ssl

  • Support for MLKEM hybrid algorithms x25519mlkem768, secp384r1mlkem1024, secp256r1mlkem768 in TLS-1.3

    ssl, public_key

  • Added support in public_key and ssl for post quantum algorithm SLH-DSA.

erts, kernel

  • Support for the socket options TCP_KEEPCNT, TCP_KEEPIDLE, and TCP_KEEPINTVL have been implemented for gen_tcp, as well as TCP_USER_TIMEOUT for both gen_tcp and socket.

OTP

  • Publish OpenVEX statements in https://erlang.org/download/vex/

    OpenVEX statements contain the same information as the OTP advisories, with the addition of vendor CVEs for which Erlang/OTP is not affected. This is important to silence vulnerability scanners that may claim Erlang/OTP to be vulnerable to vendor dependency projects, e.g., openssl.

    OpenVEX statements will be published in https://erlang.org/download/vex/ where there will be an OTP file per release, e.g., https://erlang.org/download/vex/otp-28.openvex.json.

    Erlang/OTP publishes OpenVEX statements for all supported releases, that is, as of today, OTP-26, OTP-27, and OTP-28.

    The source SBOM tooling (oss-review-toolkit) has been updated to produce source SBOM in SPDX v2.3 format, and the source SBOM now links OpenVEX statements to a security external reference. This means that by simply analyzing the source SBOM, everyone can further read the location of the OpenVEX statements and further process them.

For details about bugfixes and potential incompatibilities see the Erlang 28.3 README

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here:

Permalink

Lazier Binary Decision Diagrams (BDDs) for set-theoretic types

The Elixir team and the CNRS are working on a set-theoretic type system for Elixir which, simply put, is a type-system powered by unions, intersections, and negations. As part of the implementation of said type systems, we need an efficient way of representing said operations. This article discusses the existing approaches found in theory and practice, as well as the improvements we have introduced as part of Elixir v1.19.

This article covers the implementation details of the type system. You don’t need to understand these internals to use the type system, just as you don’t need to know virtual machine bytecodes or compiler passes to use a programming language. Our goal is to document our progress and provide guidance for future maintainers and implementers. Let’s get started.

DNFs - Disjunctive Normal Forms

A Disjunctive Normal Form (DNF) is a standardized way of expressing logical formulas using only disjunctions (unions) of conjunctions (intersections). In the context of set-theoretic type systems, DNFs provide a canonical representation for union and intersection types, represented respectively as or and and in Elixir.

In Elixir, we would represent those as lists of lists. Consider a type expression like (A and B) or (C and D). This is already in DNF, it’s a union of intersections, and it would be represented as: [[A, B], [C, D]]. This means performing unions between two DNFs is a simple list concatenation:

def union(dnf1, dnf2), do: dnf1 ++ dnf2

However, more complex expressions like A and (B or C) need to be converted. Using distributive laws, this becomes (A and B) or (A and C), which is now in DNF. In other words, the intersection of DNFs is a Cartesian product:

def intersection(dnf1, dnf2) do
  for intersections1 <- dnf1,
      intersections2 <- dnf2 do
    intersections1 ++ intersections2
  end
end

The advantage of DNFs is their simple structure. Every type can be represented as unions of intersecting terms, making operations like checking if a type is empty simply a matter of checking if all unions have at least one intersection that is empty:

def empty?(dnf) do
  Enum.all?(dnf, fn intersections ->
    Enum.any?(intersections, &empty_component?/1)
  end)
end

On the other hand, the snippets above already help us build an intuition on the drawbacks of DNFs.

First, we have seen how intersections are Cartesian products, which can lead to exponential blow ups when performing the intersection of unions. For example, (A₁ or A₂) and (B₁ or B₂) and (C₁ or C₂) leads to (A₁ and B₁ and C₁) or (A₁ and B₁ and C₂) or (A₁ and B₂ and C₁) or ..., with 8 distinct unions.

Furthermore, if we implement unions as simple list concatenations, those unions can end up with duplicated entries, which exacerbates the exponential blow up when we perform intersections of these unions. This forces us to aggressively remove duplicates in unions, making it more complex and expensive than a concatenation.

Despite their limitations, DNFs served us well and were the data structure used as part of Elixir v1.17 and v1.18. However, since Elixir v1.19 introduced type inference of anonymous functions, negations became more prevalent in the type system, making exponential growth more frequent. Let’s understand why.

Inferring anonymous functions

Imagine the following anonymous function:

fn
  %{full_name: full} -> "#{full}"
  %{first_name: first, last_name: last} -> "#{last}, #{first}"
end

We can say the first clause accepts any map with the key full_name. The second clause accepts any map with the keys first_name and last_name which DO NOT have the key full_name (otherwise they would have matched the first clause). Therefore, the inferred type should be:

$ %{full_name: String.Chars.t()} -> String.t()
$ %{first_name: String.Chars.t(), last_name: String.Chars.t()} and not
    %{full_name: String.Chars.t()} -> String.t()

As you can see, in order to express this type, we need a negation (not). Or, more precisely, a difference since A and not B is the same as A - B.

Implementing negations/differences in DNFs is relatively straightforward. Instead of lists of lists, we now use lists of two-element tuples, where the first element is a list of positive types, and the second is a list of negative types. For example, previously we said (A and B) or (C and D) would be represented as [[A, B], [C, D]], now it will be represented as:

[{[A, B], []}, {[C, D], []}]

While (A and not B) or C or D is represented as:

[{[A], [B]}, {[C], []}, {[D], []}]

The difference between two DNFs is implemented similarly to intersections, except we now need to perform the Cartesian product over the positive and negative parts of each conjunction. And given anonymous functions have differences, inferring the types of anonymous functions are now exponentially expensive, which caused some projects to take minutes to compile. Not good!

BDDs - Binary Decision Diagrams

Luckily, those exact issues are well documented in literature and are addressed by Binary Decision Diagrams (BDDs), introduced by Alain Frisch (2004) and later recalled and expanded by Giuseppe Castagna (2016).

BDDs represent set-theoretic operations as an ordered tree. This requires us to provide an order, any order, across all types. Given all Elixir values have a total order, that’s quite straightforward. Furthermore, by ordering it, we can detect duplicates as we introduce nodes in the tree. The tree can have three distinct node types:

type bdd() = :top or :bottom or {type(), constrained :: bdd(), dual :: bdd()}

:top represents the top type (where the intersection type and :top returns type) and :bottom represents the bottom type (where the intersection type and :bottom returns :bottom). Non-leaf nodes are represented via a three-element tuple, where the first element is the type (what we have been calling A, B… so far), the second element is called in literature the constrained branch, and the third element is the dual branch.

In order to compute the actual type of a non-leaf node, we need to compute (type() and constrained()) or (not type() and dual()) (hence the names constrained and dual). Let’s see some examples.

The type A is represented as {A, :top, :bottom}. This is because, if we compute (A and :top) or (not A and :bottom), we get A or :bottom, which is equivalent to A.

The type not A is represented as {A, :bottom, :top}, and it gives us (A and :bottom) or (not A and :top), which yields :bottom or not A, which is equivalent to not A.

The type A and B, assuming A < B according to a total order, is represented as {A, {B, :top, :bottom}, :bottom}. Expanding it node by node gives us:

(A and ((B and :top) or (not B and :bottom))) or (not A and :bottom)
(A and (B or :bottom)) or (not A and :bottom)
(A and B) or :bottom
(A and B)

While the difference A and not B is represented as {A, {B, :bottom, :top}, :bottom}, which we also expand node by node:

(A and ((B and :bottom) or (not B and :top))) or (not A and :bottom)
(A and (:bottom or not B)) or (not A and :bottom)
(A and not B) or :bottom
(A and not B)

Finally, the union A or B is implemented as {A, :top, {B, :top, :bottom}}. Let’s expand it:

(A and :top) or (not A and ((B and :top) or (not B and :bottom)))
(A and :top) or (not A and (B or :bottom))
A or (not A and B)
(A or not A) and (A or B)
:top and (A or B)
A or B

In other words, Binary Decision Diagrams allow us to represent unions, intersections, and differences efficiently, removing the exponential blow up. Guillaume Duboc implemented them as part of Elixir v1.19, addressing the bottlenecks introduced as part of the new type system features… but unfortunately BDDs introduced new slow downs.

The issue with BDDs comes when applying unions to intersections and differences. Take the following type (A and B) or C. Since we need to preserve the order A < B < C, it would be represented as:

{A, {B, :top, {C, :top, :bottom}}, {C, :top, :bottom}}

which can be expanded as:

(A and ((B and :top) or (not B and ((C and :top) or (not C and :bottom))))) or (not A and ((C and :top) or (not C and :bottom)))
(A and (B or (not B and C))) or (not A and C)
(A and (B or C)) or (not A and C)
(A and B) or (A and C) or (not A and C)
(A and B) or C

As you can see, although the representation is correct, its expansion ends-up generating too many disjunctions. And while we can simplify them back to (A and B) or C symbolically, doing such simplications in practice are too expensive.

In other words, the BDD expansion grows exponentially in size on consecutive unions, which is particularly troublesome because we must expand the BDD every time we check for emptiness or subtyping.

At the end of the day, it seems we traded faster intersections/differences for slower unions. Perhaps we can have our cake and eat it too?

BDDs with lazy unions (or ternary decision diagrams)

Luckily, the issue above was also forecast by Alain Frisch (2004), where he suggests an additional representation, called BDDs with lazy unions.

In a nutshell, we introduce a new element, called uncertain, to each non-leaf node to represent unions:

type lazy_bdd() = :top or :bottom or
  {type(), constrained :: lazy_bdd(), uncertain :: lazy_bdd(), dual :: lazy_bdd()}

We’ll refer to the uncertain as unions going forward.

The type of each non-leaf node can be computed by (type() and constrained()) or uncertain() or (not type() and dual()). Here are some examples:

A = {A, :top, :bottom, :bottom}
A and B = {A, {B, :top, :bottom, :bottom}, :bottom, :bottom}
A or B = {A, :top, {B, :top, :bottom, :bottom}, :bottom}

And, going back to (A and B) or C, it can be represented as:

{A, {B, :top, :bottom, :bottom}, {C, :top, :bottom, :bottom}, :bottom}

The duplication of C is fully removed. With our new representation in hand, the next step is to implement union, intersection, and difference of lazy BDDs, using the formulas found in literature and described below.

Assuming that a lazy BDD B is represented as {a, C, U, D}, and therefore B1 = {a1, C1, U1, D2} and B2 = {a2, C2, U2, D2}, the union of the lazy BDDs B1 or B2 can be computed as:

{a1, C1 or C2, U1 or U2, D1 or D2} when a1 == a2
{a1, C1, U1 or B2, D1} when a1 < a2
{a2, C2, B1 or U2, D2} when a1 > a2

The intersection B1 and B2 is:

{a1, (C1 or U1) and (C2 or U2), :bottom, (D1 or U1) and (D2 or U2)} when a1 == a2
{a1, C1 and B2, U1 and B2, D1 and B2} when a1 < a2
{a2, B1 and C2, B1 and U2, B1 and D2} when a1 > a2

The difference B1 and not B2 is:

{a1, (C1 or U1) and not (C2 or U2), :bottom, (D1 or U1) and not (D2 or U2)} when a1 == a2
{a1, (C1 or U1) and not B2, :bottom, (D1 or U1) and not B2} when a1 < a2
{a2, B1 and not (C2 or U2), :bottom, B1 and not (D2 or U2)} when a1 > a2

Guillaume Duboc first implemented lazy BDDs to represent our function types, addressing some of the bottlenecks introduced alongside BDDs. Afterwards, we attempted to convert all types to use lazy BDDs, hoping they would address the remaining bottlenecks, but that was not the case. There were still some projects that type checked instantaneously in Elixir v1.18 (which used DNFs) but took minutes on v1.19 release candidates, which could only point to large unions still being the root cause. However, weren’t lazy BDDs meant to address the issue with unions?

That was the question ringing in Guillaume’s head and in mine after an hours-long conversation, when we decided to call it a day. Unbeknownst to each other, we both continued working on the problem that night and the following morning. Separately, we were both able to spot the issue and converge on the same solution.

Lazier BDDs (for intersections)

If you carefully look at the formulas above, you can see that intersections and differences of equal nodes cause a distribution of unions. Here is the intersection:

{a1, (C1 or U1) and (C2 or U2), :bottom, (D1 or U1) and (D2 or U2)} when a1 == a2

Notice how U1 and U2 now appear on both constrained and dual parts and the whole union part of the node disappeared, now listed simply as :bottom.

In addition, considering the common case where C1 = C2 = :top and D1 = D2 = :bottom, the node above becomes {a1, :top, :bottom, U1 and U2}, which effectively moves the unions to the dual part. If you play close attention to it, since the uncertain is now :bottom, we reverted back to the original BDD representation. Any further union on those nodes will behave exactly as in the non-lazy BDDs, which we know to be problematic.

In other words, certain operations on lazy BDDs cause unions to revert to the previous BDD representation. So it seems lazy BDDs are not lazy enough? Could we stop this from happening?

Guillaume and I arrived at a new formula using different approaches. Given Guillaume’s approach can also be used to optimize differences, that’s the one I will show below. In particular, we know the intersection of equal nodes is implemented as:

{a1, (C1 or U1) and (C2 or U2), :bottom, (D1 or U1) and (D2 or U2)} when a1 == a2

If we distribute the intersection in the constrained part, we get:

(C1 and C2) or (C1 and U2) or (U1 and C2) or (U1 and U2)

If we distribute the intersection in the dual part, we get:

(D1 and D2) or (D1 and U2) or (U1 and D2) or (U1 and U2)

We can clearly see both parts have U1 and U2, which we can then move to the union! Leaving us with:

{a1,
 (C1 and C2) or (C1 and U2) or (U1 and C2),
 (U1 and U2),
 (D1 and D2) or (D1 and U2) or (U1 and D2)} when a1 == a2

We can then factor out C1 in the constrained and D1 in the dual (or C2 and D2 respectively), resulting in:

{a1,
 (C1 and (C2 or U2)) or (U1 and C2),
 (U1 and U2),
 (D1 and (D2 or U2)) or (U1 and D2)} when a1 == a2

While this new formula requires more operations, if we consider the common case C1 = C2 = :top and D1 = D2 = :bottom, we now have {a1, :top, U1 and U2, :bottom}, with the unions perfectly preserved in the middle. We independently implemented this formula and noticed it addressed all remaining bottlenecks!

Lazier BDDs (for differences)

The issues we outlined above for intersections are even worse for differences. Let’s check the difference formula:

{a1, (C1 or U1) and not (C2 or U2), :bottom, (D1 or U1) and not (D2 or U2)} when a1 == a2
{a1, (C1 or U1) and not B2, :bottom, (D1 or U1) and not B2} when a1 < a2
{a2, B1 and not (C2 or U2), :bottom, B1 and not (D2 or U2)} when a1 > a2

As you can see, all operations shuffle the union nodes and return :bottom. But this time, we know how to improve it! Let’s start with a1 == a2. If we expand the difference in the constrained part, we get:

(C1 and not C2 and not U2) or (U1 and not C2 and not U2)

If we do the same in the dual part, we have:

(D1 and not D2 and not U2) or (U1 and not D2 and not U2)

Unfortunately, there are no shared union terms between the constrained and dual parts, unless C2 and D2 are :bottom. Therefore, instead of fully rewriting the difference of equal nodes, we add the following special case:

{a1, C1 and not U2, U1 and not U2, D1 and not U2}
when a1 == a2 and C2 == :bottom and D2 == :bottom

We can apply a similar optimization when a1 < a2. The current formula:

{a1, (C1 or U1) and not B2, :bottom, (D1 or U1) and not B2} when a1 < a2

The constrained part can be written as (C1 and not B2) or (U1 and not B2) and the dual part as (D1 and not B2) or (U1 and not B2). Given (U1 and not B2) is shared on both parts, we can also convert it to a union, resulting in:

{a1, C1 and not B2, U1 and not B2, D1 and not B2} when a1 < a2

Unfortunately, we can’t apply this for a2 > a1, as differences are asymmetric and do not distribute over unions on the right side. Therefore, the updated formula for difference is:

{a1, C1 and not U2, U1 and not U2, D1 and not U2} when a1 == a2 and C2 == :bottom and D2 == :bottom
{a1, (C1 or U1) and not (C2 or U2), :bottom, (D1 or U1) and not (D2 or U2)} when a1 == a2
{a1, C1 and not B2, U1 and not B2, D1 and not B2} when a1 < a2
{a2, B1 and not (C2 or U2), :bottom, B1 and not (D2 or U2)} when a1 > a2

With these new formulas, all new typing features in Elixir v1.19 perform efficiently and most projects now type check faster than in Elixir v1.18. We have also been able to use the rules above to derive additional optimizations for differences, such as when a1 == a2 and U2 == :bottom, which will be part of future releases. Hooray!

Acknowledgements

As there is an increasing interest in implementing set-theoretic types for other dynamic languages, we hope this article shines a brief light on the journey and advancements made by the research and Elixir teams when it comes to representing set-theoretic types.

The type system was made possible thanks to a partnership between CNRS and Remote. The development work is currently sponsored by Fresha and Tidewave.

Permalink

Elixir v1.19 released: enhanced type checking and up to 4x faster compilation for large projects

Elixir v1.19 brings further improvements to the type system and compilation times, allowing us to find more bugs, faster.

Type system improvements

This release improves the type system by adding type inference of anonymous functions and type checking of protocols. These enhancements seem simple on the surface but required us to go beyond existing literature by extending current theory and developing new techniques. We will outline the technical details in future articles. For now, let’s look at what’s new.

Type checking of protocol dispatch and implementations

This release adds type checking when dispatching and implementing protocols.

For example, string interpolation in Elixir uses the String.Chars protocol. If you pass a value that does not implement said protocol, Elixir will now emit a warning accordingly.

Here is an example passing a range, which cannot be converted into a string, to an interpolation:

defmodule Example do
  def my_code(first..last//step = range) do
    "hello #{range}"
  end
end

the above emits the following warnings:

warning: incompatible value given to string interpolation:

    data

it has type:

    %Range{first: term(), last: term(), step: term()}

but expected a type that implements the String.Chars protocol, it must be one of:

    dynamic(
      %Date{} or %DateTime{} or %NaiveDateTime{} or %Time{} or %URI{} or %Version{} or
        %Version.Requirement{}
    ) or atom() or binary() or float() or integer() or list(term())

Warnings are also emitted if you pass a data type that does not implement the Enumerable protocol as a generator to for-comprehensions:

defmodule Example do
  def my_code(%Date{} = date) do
    for(x <- date, do: x)
  end
end

will emit:

warning: incompatible value given to for-comprehension:

    x <- date

it has type:

    %Date{year: term(), month: term(), day: term(), calendar: term()}

but expected a type that implements the Enumerable protocol, it must be one of:

    dynamic(
      %Date.Range{} or %File.Stream{} or %GenEvent.Stream{} or %HashDict{} or %HashSet{} or
        %IO.Stream{} or %MapSet{} or %Range{} or %Stream{}
    ) or fun() or list(term()) or non_struct_map()

Type checking and inference of anonymous functions

Elixir v1.19 can now type infer and type check anonymous functions. Here is a trivial example:

defmodule Example do
  def run do
    fun = fn %{} -> :map end
    fun.("hello")
  end
end

The example above has an obvious typing violation, as the anonymous function expects a map but a string is given. With Elixir v1.19, the following warning is now printed:

    warning: incompatible types given on function application:

        fun.("hello")

    given types:

        binary()

    but function has type:

        (dynamic(map()) -> :map)

    typing violation found at:
    │
  6 │     fun.("hello")
    │        ~
    │
    └─ mod.exs:6:8: Example.run/0

Function captures, such as &String.to_integer/1, will also propagate the type as of Elixir v1.19, arising more opportunity for Elixir’s type system to catch bugs in our programs.

Acknowledgements

The type system was made possible thanks to a partnership between CNRS and Remote. The development work is currently sponsored by Fresha, Starfish*, and Dashbit.

Faster compile times in large projects

This release includes two compiler improvements that can lead up to 4x faster builds in large codebases.

While Elixir has always compiled the given files in project or a dependency in parallel, the compiler would sometimes be unable to use all of the machine resources efficiently. This release addresses two common limitations, delivering performance improvements that scale with codebase size and available CPU cores.

Code loading bottlenecks

Prior to this release, Elixir would load modules as soon as they were defined. However, because the Erlang part of code loading happens within a single process (the code server), this would make it a bottleneck, reducing parallelization, especially on large projects.

This release makes it so modules are loaded lazily. This reduces the pressure on the code server and the amount of work during compilation, with reports of more than two times faster compilation for large projects. The benefits depend on the codebase size and the number of CPU cores available.

Implementation wise, the parallel compiler already acts as a mechanism to resolve modules during compilation, so we built on that. By making sure the compiler controls both module compilation and module loading, it can also better guarantee deterministic builds.

There are two potential regressions with this approach. The first one happens if you spawn processes during compilation which invoke other modules defined within the same project. For example:

defmodule MyLib.SomeModule do
  list = [...]

  Task.async_stream(list, fn item ->
    MyLib.SomeOtherModule.do_something(item)
  end)
end

Because the spawned process is not visible to the compiler, it won’t be able to load MyLib.SomeOtherModule. You have two options, either use Kernel.ParallelCompiler.pmap/2 or explicitly call Code.ensure_compiled!(MyLib.SomeOtherModule) before spawning the process that uses said module.

The second one is related to @on_load callbacks (typically used for NIFs) that invoke other modules defined within the same project. For example:

defmodule MyLib.SomeModule do
  @on_load :init

  def init do
    MyLib.AnotherModule.do_something()
  end

  def something_else do
    ...
  end
end

MyLib.SomeModule.something_else()

The reason this fails is because @on_load callbacks are invoked within the code server and therefore they have limited ability to load additional modules. It is generally advisable to limit invocation of external modules during @on_load callbacks but, in case it is strictly necessary, you can set @compile {:autoload, true} in the invoked module to address this issue in a forward and backwards compatible manner.

Both snippets above could actually lead to non-deterministic compilation failures in the past, and as a result of these changes, compiling these cases are now deterministic.

Parallel compilation of dependencies

This release introduces a variable called MIX_OS_DEPS_COMPILE_PARTITION_COUNT, which instructs mix deps.compile to compile dependencies in parallel.

While fetching dependencies and compiling individual Elixir dependencies already happened in parallel, as outlined in the previous section, there were pathological cases where performance gains would be left on the table, such as when compiling dependencies with native code or dependencies where one or two large files would take most of the compilation time.

By setting MIX_OS_DEPS_COMPILE_PARTITION_COUNT to a number greater than 1, Mix will now compile multiple dependencies at the same time, using separate OS processes. Empirical testing shows that setting it to half of the number of cores on your machine is enough to maximize resource usage. The exact speed up will depend on the number of dependencies and the number of machine cores and some users reported up to 4x faster compilation times when using our release candidates. If you plan to enable it on CI or build servers, keep in mind it will most likely have a direct impact on memory usage too.

Erlang/OTP 28 support

Elixir v1.19 officially supports Erlang/OTP 28.1+ and later. In order to support the new Erlang/OTP 28 representation for regular expressions, structs can now control how they are escaped into abstract syntax trees by defining a __escape__/1 callback.

On the other hand, the new representation for regular expressions in Erlang/OTP 28+ implies they can no longer be used as default values for struct fields. Therefore, this is not allowed:

defmodule Foo do
  defstruct regex: ~r/foo/
end

You can, however, still use regexes when initializing the structs themselves:

defmodule Foo do
  defstruct [:regex]

  def new do
    %Foo{regex: ~r/foo/}
  end
end

OpenChain certification

Elixir v1.19 is also our first release following OpenChain compliance, as previously announced. In a nutshell:

  • Elixir releases now include a Source SBoM in CycloneDX 1.6 or later and SPDX 2.3 or later formats.
  • Each release is attested along with the Source SBoM.

These additions offer greater transparency into the components and licenses of each release, supporting more rigorous supply chain requirements.

This work was performed by Jonatan Männchen and sponsored by the Erlang Ecosystem Foundation.

Summary

There are many other goodies in this release, such as improved option parsing, better debuggability and performance in ExUnit, the addition of mix help Mod, mix help Mod.fun, mix help Mod.fun/arity, and mix help app:package to make documentation accessible via shell for humans and agents, and much more. See the CHANGELOG for the complete release notes.

Happy coding!

Permalink

Ongoing Tradeoffs, and Incidents as Landmarks

One of the really valuable things you can get out of in-depth incident investigations is a better understanding of how work is actually done, as opposed to how we think work is done, or how it is specified. A solid approach to do this is to get people back into what things felt like at the time, and interview them about their experience to know what they were looking for, what was challenging. By taking a close look at how people deal with exceptional situations and how they translate goals into actions you also get to learn a lot about what's really important in normal times.

Incidents disrupt. They do so in undeniable ways that more or less force organizations to look inwards and question themselves. The disruption is why they are good opportunities to study and change how we do things.

In daily work, we'll tend to frame things in terms of decisions: do I ship now or test more? Do I go at it slow to really learn how this works or do I try and get AI to slam through it and figure it out in more depth later? Do we cut scope or move the delivery date? Do I slow down my own work to speed up a peer who needs some help? Is this fast enough? Should I argue in favor of an optimization phase? Do I fix the flappy test from another team or rerun it and move on? Do I address the low urgency alert now even though it will create a major emergency later, or address the minor emergency already in front of me? As we look back into our incidents and construct explanations, we can shed more light on what goes on and what's important.

In this post, I want to argue in favor of an additional perspective, centered considering incidents to be landmarks useful to orient yourself in a tradeoff space.

From Decisions to Continuous Tradeoffs

Once you look past mechanical failures and seek to highlight the challenges of normal work, you start to seek ways to make situations clearer, not just to prevent undesirable outcomes, but to make it easier to reach good ones too.

Over time, you may think that decisions get better or worse, or that some types shift and drift as you study an ever-evolving set of incidents. There are trends, patterns. It will feel like a moving target, where some things that were always fine start being a problem. Sometimes it will seem that external pressures, outside of any employee's control, create challenges that seem to emerge from situations related to previous ones, which all make incidents increasingly feel like natural consequences of having to make choices.

Put another way, you can see incidents as collections of events in which decisions happen. Within that perspective, learning from them means hoping for participants to get better at dealing with the ambiguity and making future decisions better. But rather than being collections of events in which decisions happen, it's worthwhile to instead consider incidents as windows letting you look at continuous tradeoffs.

By continuous tradeoffs, I mean something similar to this bit of an article Dr. Laura Maguire and I co-authored titled Navigating Tradeoffs in Software Failures:

Tradeoffs During Incidents Are Continuations of Past Tradeoffs
Multiple answers hinted at the incident being an outcome of existing patterns within the organization where they had happened, where communication or information flow may be incomplete or limited. Specifically, the ability of specific higher-ranking contributors who can routinely cross-pollinate siloed organizations is called as useful for such situations [...]
[...]
The ways similar tradeoffs were handled outside of incidents are revisited during the incidents. Ongoing events provide new information that wasn’t available before, and the informational boundaries that were in place before the outage became temporarily suspended to repair shared context.

A key point in this quote is that what happens before, during, and after an incident can all be projected as being part of the same problem space, but with varying amounts of information and uncertainty weighing on the organization. There are also goals, values, priorities, and all sorts of needs and limitations being balanced against each other.

When you set up your organization to ship software and run it, you do it in response to and in anticipation of these pressure gradients. You don’t want to move slow with full consensus on all decisions. You don’t want everyone to need to know what everybody else is doing. Maybe your system is big enough you couldn’t anyway. You adopt an organizational structure, processes, and select what information gets transmitted and how across the organization so people get what they need to do what is required. You give some people more control of the roadmap than others, you are willing to pay for some tools and not others, you will slow down for some fixes but live with other imperfections, you will hire or promote for some teams before others, you will set deadlines and push for some practices and discourage others, because as an organization, you think this makes you more effective and competitive.

When there’s a big incident happening and you find out you need half a dozen teams to fix things, what you see is a sudden shift in priorities. Normal work is suspended. Normal organizational structure is suspended. Normal communication patterns are suspended. Break glass situations mean you dust off irregular processes and expedite things you wouldn’t otherwise, on schedules you wouldn’t usually agree to.

In the perspective of decisions, it's possible the bad system behavior gets attributed to suboptimal choices, and we'll know better in the future through our learning now that we've shaken up our structure for the incident. In the aftermath, people keep suspending regular work to investigate what happened, share lessons, and mess with the roadmap with action items outside of the regular process. Then you more or less go back to normal, but with new knowledge and follow-up items.

Acting on decisions creates a sort of focus on how people handle the situations. Looking at incidents like they're part of a continuous tradeoff space lets you focus on how context gives rise to the situations.

In this framing, the various goals, values, priorities, and pressures are constantly being communicated and balanced against each other, and create an environment that shapes what solutions and approaches we think are worth pursuing or ignoring. Incidents are new information. The need to temporarily re-structure the organization is a clue that your "steady state" (even if this term doesn't really apply) isn't perfect.

Likewise, in a perspective of continuous tradeoffs, it's also possible and now easier for the "bad" system behavior to be a normal outcome of how we've structured our organization.

The type of prioritizations, configurations, and strategic moves you make mean that some types of incidents are more likely than others. Choosing to build a multi-tenant system saves money from shared resources but reduces isolation between workload types, such that one customer can disrupt others. Going multi-cloud prevents some outages but comes with a tax in terms of having to develop or integrate services that you could just build around a single provider. Keeping your infrastructure team split from your product org and never talking to sales means they may not know about major shifts in workloads that might come soon (like a big marketing campaign, a planned influx of new heavy users, or new features that are more expensive to run) and will stress their reactive capacity and make work more interrupt-driven.

Reacting to incidents by patching things up and moving on might bring us back to business as usual, but it does not necessarily question whether we're on the right trajectory.

Incidents as Navigational Landmarks

Think of old explorer maps, or even treasure maps: they are likely inaccurate, full of unspecified areas, and focused mainly on features that would let someone else figure out how to get around. The key markers on them would be forks in some roads or waterways, and landmarks.

A map drawn by Samual de Champlain in 1632, representing the Ottawa region, showing the route he took in a 1616 trip.

If you were to find yourself navigating with a map like this, the way you'd know you were heading the right direction is by confirming your position by finding landmarks or elements matching your itinerary, or knowing you're actually not on the right path at all by noticing features that aren't where you expect them or not there at all: you may have missed a turn if you suddenly encounter a ravine that wasn't on your planned path, or not until you had first seen a river.

The analogy I want to introduce is to think of the largely unpredictable solution space of tradeoffs as the poorly mapped territory, and of incidents as potential landmarks when finding your way. They let you know if you're going in a desired general direction, but also if you're entirely in the wrong spot compared to where you wanted to be. You always keep looking for them; on top of being point-in-time feedback mechanisms when they surprise you, they're also precious ongoing signals in an imprecise world.

Making tradeoffs implies that there are types of incidents you expect to see happening, and some you don't.

If you decide to ship prototypes earlier to validate their market fit, before having fully analyzed usage patterns and prepared scaling work, then getting complaints from your biggest customers trying them and causing slowdowns is actually in line with your priorities. That should be a plausible outcome. If you decide to have a team ignore your usual design process (say, RFCs or ADRs that make sure it integrates with the rest of the system well) in order to ship faster, then you should be ready for issues arising from clashes there. If you emphasize following procedures and runbooks, you might expect documented cases to be easier to handle but the truly surprising ones to be relatively more challenging and disruptive since you did not train as much for coping with unknown situations.

All these elements might come to a head when a multitenant system gets heavy usage from a large customer trying out a new feature developed in isolation (and without runbooks), which then impacts other parts of the system, devolving into a broader outage while your team struggles to figure out how to respond. This juncture could be considered to be a perfect storm as much as it could be framed as a powder keg—which one we get is often decided based on the amount of information available (and acted on) at the time, with some significant influence from hindsight.

You can't be everywhere all at once in the tradeoff space, and you can't prevent all types of incidents all at once. Robustness in some places create weaknesses in others. Adaptation lets you reconfigure as you go, but fostering that capacity to adapt requires anticipation and the means to do so.

Either the incidents and their internal dynamics are a confirmation of the path you've chosen and it's acceptable (even if regrettable), or it's a path you don't want to be on and you need to keep that in mind going forward.

Incidents as landmarks is one of the tools that lets you notice and evaluate whether you need to change priorities, or put your thumb on the scale another way. You can suspect that the position you’re in was an outcome of these priorities. You might want to correct not just your current position, but your overall navigational strategy. Note that an absence of incidents doesn't mean you’re doing well, just that there are no visible landmarks for now; if you still seek a landmark, maybe near misses and other indirect signs might help.

But to know how to orient yourself, you need more than local and narrow perspectives to what happened.

If your post-incident processes purely focus on technical elements and response, then they may structurally locate responsibility on technical elements and responders. The incidents as landmarks stance demands that your people setting strategy do not consider themselves to be outside of the incident space, but instead see themselves as indirect but relevant participants. We're not looking to shift accountability away, but to broaden our definition of what the system is.

You want to give them the opportunity to continually have the pressure gradients behind goal conflicts and their related adaptations in scope for incident reviews.

One thing to be careful about here is that to find the landmarks and make them visible, you need to go beyond the surface of the incident. The best structures to look for are going to be stable; forests are better than trees, but geological features are even better.

What you'll want to do is keep looking for second stories, elements that do not simply explain a specific failure, but also influence every day successes. They're elements that incidents give you opportunities to investigate, but that are in play all the time. They shape the work by their own existence, and they become the terrain that can both constrain and improve how your people make things happen.

When identifying contributing factors, it's often factors present whether things are going well or not that can be useful in letting you navigate tradeoff spaces.

What does orientation look like? Once you have identified some of these factors that has systemic impact, then you should expect the related intervention (if any is required because you think the tradeoff should not be the same going forward) to also be at a system level.

Are you going to find ways to influence habits, tweak system feedback mechanisms, clarify goal conflicts, shift pressures or change capacity? Then maybe the landmarks are used for reorienting your org. But if the interventions get re-localized down to the same responders or as new pressures added on top of old ones (making things more complex to handle, rather than clarifying them), there are chances you are letting landmarks pass you by.

The Risks of Pushing for This Approach

The idea of using incidents as navigational landmarks can make sense if you like framing the organization as its own organism, a form of distributed cognition that makes its way through its ecosystem with varying amounts of self-awareness. There's a large distance between that abstract concept, and you, as an individual, running an investigation and writing a report, where even taking the time to investigate is subject to the same pressures and constraints as the rest of normal work.

As Richard Cook pointed out, the concept of human error can be considered useful for organizations looking to shield themselves from the liabilities of an incident: if someone can be blamed for events, then the organization does not need to change what it normally does. By finding a culprit, blame and human error act like a lightning rod that safely diverts consequences from the org’s structure itself.

In organizations where this happens, trying to openly question broad priorities and goal conflicts can mark you as a threat to these defence mechanisms. Post-incident processes are places where power dynamics are often in play and articulate themselves.

If you are to use incidents as landmarks, do it the way you would for any other incident investigation: frame all participants (including upper management) to be people trying to do a good job in a challenging world, maintain blame awareness, try to find how the choices made sense at the time, let people tell their stories, seek to learn before fixing, and don’t overload people with theory.

Maintaining the trust the people in your organization give you is your main priority in the long term, and sometimes, letting go of some learnings today to protect your ability to keep doing more later is the best decision to make.

Beyond personal risk, being able to establish incidents as landmarks and using them to steer an organization means that your findings become part of how priorities and goals are set and established. People may have vested interests in you not changing things that currently advantage them, or may try to co-opt your process and push for their own agendas. The incidents chosen for investigations and the type of observations allowed or emphasized by the organization will be of interest. Your work is also part of the landscape.

Permalink

Erlang/OTP 28.1 Release

OTP 28.1

Erlang/OTP 28.1 is the first maintenance patch package for OTP 28, with mostly bug fixes as well as improvements.

Potential incompatibilities:

  • The internal inet_dns_tsig and inet_res modules have been fixed to TSIG verify the correct timestamp. In the process two undocumented error code atoms have been corrected to notauth and notzone to adhere to the DNS RFCs. Code that relied on the previous incorrect values may have to be corrected.

HIGHLIGHTS

  • A User’s Guide to dbg is now available in the documentation.
  • Support for quantum crypto signature algorithm ML-DSA and key exchange algorithm ML-KEM (ssl, public_key and crypto if built and linked with OpenSSL 3.5).

For details about bugfixes and potential incompatibilities see the Erlang 28.1 README

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here:

Permalink

Interoperability in 2025: beyond the Erlang VM

The Erlang Virtual Machine has, historically, provided three main options for interoperability with other languages and ecosystems, with different degrees of isolation:

  • NIFs (Native Implemented Functions) integrate with third party code in the same memory space via C bindings. This translates to low overhead and best performance but it also means faulty code can bring the whole Virtual Machine down, bypassing some of Erlang’s fault-tolerance guarantees

  • Ports start a separate Operating System process to communicate with other languages through STDIN/STDOUT, guaranteeing process isolation. In a typical Erlang fashion, ports are fully evented, concurrent, and distributed (i.e. you can pass and communicate with ports across nodes)

  • Distributed nodes rely on Erlang well-defined distribution and serialization protocol to communicate with other runtimes. Any language can implement said protocol and act as an Erlang node, giving you full node isolation between runtimes

Those mechanisms have led to multiple integrations between Elixir and other programming languages, such as Zig and Rust, and more recently C++, Python, and Swift, which we will explore here.

Furthermore, alternative implementations of the Erlang VM and Elixir have brought a fourth category of interoperability through portability: where your Elixir program runs in a completely different environment to leverage its native capabilities, libraries, and ecosystem, while maintaining Elixir’s syntax and semantics (either partially or fully). This opens up some exciting new possibilities and since this approach is still relatively uncharted territory, let’s dive into it first.

Portability

The AtomVM is a lightweight implementation of the Erlang VM that can run on constrained environments, such as microcontrollers with just a few hundred kilobytes of memory such as ESP32, STM32 or Pico. AtomVM supports a functional subset of Erlang VM and its standard library, all optimized to run on tiny microcontrollers.

Given its low footprint, AtomVM can also target WebAssembly, paving the way to run Elixir in web browsers and alternative WASM runtimes in the future. The Popcorn project, recently announced at ElixirConf EU 2025, builds on those capabilities to provide better interoperability between Elixir and JavaScript.

Popcorn

Popcorn is a library for running Elixir in web browsers, with JavaScript interoperability. Popcorn brings an extensive subset of Elixir semantics into the browser and, although it is in its infancy, it is already capable of running interactive Elixir code entirely client side.

And here is a quick example showing how to communicate with JavaScript from WASM:

defmodule HelloPopcorn do
  use GenServer

  @process_name :main

  def start_link(args) do
    GenServer.start_link(__MODULE__, args, name: @process_name)
  end

  @impl true
  def init(_init_arg) do
    Popcorn.Wasm.register(@process_name)
    IO.puts("Hello console!")

    Popcorn.Wasm.run_js("""
    () => {
      document.body.innerHTML = "Hello from WASM!";
    }
    """)

    :ignore
  end
end

Popcorn could help with Elixir adoption by making it really easy to create interactive guides with executable code right there in the browser. And once it’s production ready, it could enable offline, local-first applications, entirely in Elixir.

Hologram

Hologram is a full-stack isomorphic Elixir web framework that runs on top of Phoenix. It lets developers create dynamic, interactive web applications entirely in Elixir.

Hologram transpiles Elixir code to JavaScript and provides a complete framework including templates, components, routing, and client-server communication for building rich web applications.

Here is a snippet of a Hologram component that handles drawing events entirely client-side, taken from the official SVG Drawing Demo:

defmodule DrawingBoard do
  use Hologram.Component

  def init(_props, component, _server) do
    put_state(component, drawing?: false, path: "")
  end

  def template do
    ~HOLO"""
    <svg
      class="cursor-crosshair touch-none bg-black w-[75vw] h-[75vh]"
      $pointer_down="start_drawing"
      $pointer_move="draw_move"
      $pointer_up="stop_drawing"
      $pointer_cancel="stop_drawing"
    >
      <path d={@path} stroke="white" stroke-width="2" fill="none" />
    </svg>
    """
  end

  def action(:draw_move, params, component) when component.state.drawing? do
    new_path = component.state.path <> " L #{params.event.offset_x} #{params.event.offset_y}"
    put_state(component, :path, new_path)
  end

  def action(:start_drawing, params, component) do
    new_path = component.state.path <> " M #{params.event.offset_x} #{params.event.offset_y}"
    put_state(component, drawing?: true, path: new_path)
  end
end

While Popcorn runs on a lightweight implementation of the Erlang VM with all of its primitives, Hologram works directly on the Elixir syntax tree. They explore distinct paths for bringing Elixir to the browser and are both in active development.

Native Implemented Functions (NIFs)

NIFs allow us to write performance-critical or system-level code and call it directly from Erlang and Elixir as if it were a regular function.

NIFs solve practical problems like improving performance or using all Operating System capabilities. NIFs run in the same Operating System process as the VM, the same memory space. With them we can use third-party native libraries, execute syscalls, interface with the hardware, etc. On the other hand, using them can forgo some of Erlang’s stability and error handling guarantees.

Originally, NIFs could never block and had to be written in a “yielding” fashion, which limited their applicability. Since Erlang/OTP 17, however, NIFs can be scheduled to run on separate OS threads called “dirty schedulers”, based on their workloads (IO or CPU). This has directly brought Elixir and the Erlang VM into new domains, such as Numerical Elixir, and to interop with new languages and ecosystems.

C

Erlang’s NIFs directly target the C programming language and is used to implement low-level functionality present in Erlang’s standard library:

#include <erl_nif.h>

static ERL_NIF_TERM add_int64_nif(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[])
{
    int64_t a, b;
    if (!enif_get_int64(env, argv[0], &a) || !enif_get_int64(env, argv[1], &b)) {
        return enif_make_badarg(env);
    }
    return enif_make_int64(env, a + b);
}

static ErlNifFunc nif_funcs[] = {
    {"add", 2, add_int64_nif},
};

ERL_NIF_INIT("Elixir.Example", nif_funcs, NULL, NULL, NULL, NULL)

Writing NIFs in C can be verbose and error-prone. Fortunately, the Elixir ecosystem offers a number of high-quality libraries that make it possible to write NIFs in other languages, let’s check them out.

C++

Fine is a lightweight C++ library that wraps the NIF API with a modern interface. Given the widespread use of C++ in machine learning and data, Fine aims to reduce the friction of getting from Elixir to C++ and vice-versa.

Here’s the same NIF that adds two numbers in C++, using Fine:

#include <fine.hpp>

int64_t add(ErlNifEnv *env, int64_t a, int64_t b) {
  return a + b;
}

FINE_NIF(add, 0);
FINE_INIT("Elixir.Example");

Fine automatically encodes and decodes NIF arguments and return values based on the function signature, significantly reducing boilerplate code. It also has first-class support for Elixir structs, propagating C++ exceptions as Elixir exceptions, and more.

Rust

Rustler is a library for writing NIFs in Rust. The goal is to make it impossible to crash the VM when using “safe” Rust code. Furthermore, Rustler makes it easy to encode/decode Rust values to and from Elixir terms while safely and ergonomically managing resources.

Here’s an example NIF implemented with Rustler:

#[rustler::nif]
fn add(a: i64, b: i64) -> i64 {
  a + b
}

rustler::init!("Elixir.Example");

Zig

Zigler lets us write NIFs in Zig, a low-level programming language designed for maintaining robust, optimal, and reusable software. Zig removes hidden control flow, implicit memory allocation, and similar abstractions in favour of code that’s explicit and predictable.

Zigler compiles Zig code at build time and exposes it directly to Elixir, without external build scripts or glue. It tightly integrates with Elixir tooling: Zig code is formatted via mix format and documentation written in Zig appears in IEx via the h helper.

Here’s an example NIF in Zig:

iex> Mix.install([:zigler])
iex> defmodule Example do
       use Zig, otp_app: :zigler

       ~Z"""
       pub fn add(a: i64, b: i64) i64 {
         return a + b;
       }
       """
     end
iex> Example.add(1, 2)
3

We can write NIFs directly in IEx sessions, scripts, Livebook notebooks, and similar! And with Zig’s excellent interop with C, it’s really easy to experiment with native code on the Erlang VM.

Python

Pythonx runs a Python interpreter in the same OS process as your Elixir application, allowing you to evaluate Python code and conveniently convert between Python and Elixir data structures. Pythonx also integrates with the uv package manager, automating the management of Python and its dependencies.

One caveat is that Python’s Global Interpreter Lock (GIL) prevents multiple threads from executing Python code at the same time so calling Pythonx from multiple Elixir processes does not provide concurrency we might expect and can become source of bottlenecks. However, GIL is a constraint for regular Python code only. Packages with CPU-intense functionality, such as numpy, have native implementation of many functions and invoking those releases the GIL (GIL is also released when waiting on I/O).

Here’s an example of using numpy in Elixir:

iex> Mix.install([{:pythonx, "~> 0.4.0"}])
iex> Pythonx.uv_init("""
     [project]
     name = "myapp"
     version = "0.0.0"
     requires-python = "==3.13.*"
     dependencies = [
       "numpy==2.2.2"
     ]
     """)
iex> import Pythonx, only: :sigils
iex> x = 1
iex> ~PY"""
     import numpy as np

     a = np.int64(x)
     b = np.int64(2)
     a + b
     """
#Pythonx.Object<
  np.int64(3)
>

Livebook uses Pythonx to allow Elixir and Python code cells to co-exist in the same notebook (and in the same memory space), with low-overhead when transferring data between them.

Distributed nodes

Elixir, by way of Erlang, has built-in support for distributed systems. Multiple nodes can connect over a network and communicate using message passing, with the same primitives such as send and receive used for both local and remote processes.

Nodes become discoverable in the cluster simply by starting them with names. Once we connect to a node, we can send messages, spawn remote processes, and more. Here’s an example:

$ iex --name a@127.0.0.1 --cookie secret
$ iex --name b@127.0.0.1 --cookie secret
iex(a@127.0.0.1)> Node.connect(:"b@127.0.0.1")
iex(a@127.0.0.1)> node()
:"a@127.0.0.1"
iex(a@127.0.0.1)> :erpc.call(:"b@127.0.0.1", fn -> node() end)
:"b@127.0.0.1"

While Distributed Erlang is typically used for Erlang-Erlang communication, it can be also used for interacting with programs written in other programming languages. Erlang/OTP includes Erl_Interface, a C library for writing programs that can participate in the Erlang cluster. Such programs are commonly called C nodes.

Any language may implement these protocols from scratch or, alternatively, use erl_interface as its building block. For example, Erlang/OTP ships with Jinterface application, a Java library that lets JVM programs act as distributed Erlang nodes. Another recent example is the Swift Erlang Actor System, for communicating between Swift and Erlang/Elixir programs.

Ports

Last but not least, ports are the basic mechanism that Elixir/Erlang uses to communicate with the outside world. Ports are the most common of interoperability across programming languages, so we will only provide two brief examples.

In Elixir, the Port module offers a low-level API to start separate programs. Here’s an example that runs uname -s to print the current operating system:

iex> port = Port.open({:spawn, "uname -s"}, [:binary])
iex> flush()
{#Port<0.3>, {:data, "Darwin\n"}}
iex> send(port, {self(), :close})
iex> flush()
{#Port<0.3>, :closed}
:ok

Most times, however, developers use System.cmd/3 to invoke short-running programs:

iex> System.cmd("uname", ["-s"])
{"Darwin\n", 0}

Summary

This article highlights many of the several options for interoperating with Elixir and the Erlang Virtual Machine. While it does not aim to be a complete reference, it covers integration across a range of languages, such as Rust, Zig, Python and Swift, as well as portability to different environments, including microcontrollers and web browsers.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.