Debugging a Slow Starting Elixir Application

I recently had to fix an Elixir service that was slow to start. I was able to pinpoint the issue with only a few commands and I want to share a couple of the things I learned.

Dependencies

In Elixir all dependencies are “applications”. The term “application” means something different than it does outside of Elixir. In Elixir an “application” is a set of modules and behaviors. Some of these applications define their own supervision trees and must be started by Application.start/2 before they can be used. When you start your Elixir service, either via Mix or a generated Elixir release, the dependencies you specified in your mix.exs file will be started before your own code is started. If an application listed as a dependency is slow to start your application must wait until the dependency is running before it can be started.

While the behavior is simple it is recursive. Each application has its own set of dependencies that must be running before that application can be started, and some of those dependencies have dependencies of their own that must be running before they can start. This results in a dependency tree structure. To illustrate this with a little ASCII:

- your_app
  - dependency_1
    - hidden_dependency_1
    - hidden_dependency_2
  - dependency_2
    - hidden_dependency_3

For this application, the Erlang VM would likely start these applications in this order:

  1. hidden_dependency_3

  2. dependency_2

  3. hidden_dependency_2

  4. hidden_dependency_1

  5. dependency_1

  6. your_app

The application I had to fix had a lot of dependencies. Profiling each application would be tedious and time-consuming, and I had a hunch there was probably a single dependency that was the problem. Turns out it’s pretty easy to write a little code that times the start up of each application.

Profiling

Start an IEx shell the --no-start flag so that the application is available but not yet loaded or started:

iex -S mix run --no-start

Then load this code into the shell:

defmodule StartupBenchmark do
  def run(application) do
    complete_deps = deps_list(application) # (1)

    dep_start_times = Enum.map(complete_deps, fn(app) -> # (2)
      case :timer.tc(fn() -> Application.start(app) end) do
        {time, :ok} -> {time, app}
        # Some dependencies like :kernel may have already been started, we can ignore them
        {time, {:error, {:already_started, _}}} -> {time, app}
        # Raise an exception if we get an non-successful return value
        {time, error} -> raise(error)
      end
    end)

    dep_start_times
    |> Enum.sort() # (3)
    |> Enum.reverse()
  end

  defp deps_list(app) do
    # Get all dependencies for the app
    deps = Application.spec(app, :applications)

    # Recursively call to get all sub-dependencies
    complete_deps = Enum.map(deps, fn(dep) -> deps_list(dep) end)

    # Build a complete list of sub dependencies, with the top level application
    # requiring them listed last, also remove any duplicates
    [complete_deps, [app]]
    |> List.flatten()
    |> Enum.uniq()
  end
end

To highlight the important pieces from this module:

  1. Recursively get all applications that must be started in the order they need to be started in.

  2. Start each application in order; timing each one.

  3. Sort applications by start time so the slowest application is the first item in the list.

With this code finding applications that are slow to start is easy:

> StartupBenchmark.run(:your_app)
[
  {6651969, :prometheus_ecto},
  {19621, :plug_cowboy},
  {14336, :postgrex},
  {13598, :ecto_sql},
  {5123, :yaml_elixir},
  {3871, :phoenix_live_dashboard},
  {1159, :phoenix_ecto},
  {123, :prometheus_plugs},
  {64, :ex_json_logger},
  {56, :prometheus_phoenix},
  {56, :ex_ops},
  {36, :kernel},
  ...
]

These times are in microseconds so in this case prometheus_ecto is taking 6.6 seconds to start. All other applications are taking less than 20 milliseconds to start and many of them are taking less than 1 millisecond to start. prometheus_ecto is the culprit here.

Conclusion

With the code above I was able to identify prometheus_ecto as the problem. With this information I was then able to use eFlambe and a few other tools to figure out why prometheus_ecto was so slow and quickly fixed the issue.

I hope the snippet of code above will be helpful to some of you. If you like reading my blog posts please subscribe to my newsletter. I send emails out once a month with my latest posts.

Permalink

My favorite Erlang Container

2022/07/09

My favorite Erlang Container

Joe Armstrong wrote a blog post titled My favorite Erlang Program, which showed a very simple universal server written in Erlang:

universal_server()->receive{become,F}->F()end.

You could then write a small program that could fit this function F:

factorial_server()->receive{From,N}->From!factorial(N),factorial_server()end.factorial(0)->1;factorial(N)->N*factorial(N-1).

If you had an already running universal server, such as you would by having called Pid = spawn(fun universal_server/0), you could then turn that universal server into the factorial server by calling Pid ! {become, fun factorial_server/0}.

Weeds growing in driveway cracks

Joe Armstrong had a way to get to the essence of a lot of concepts and to think about programming as a fun thing. Unfortunately for me, my experience with the software industry has left me more or less frustrated with the way things are, even if the way things are is for very good reasons. I really enjoyed programming Erlang professionally, but I eventually got sidetracked by other factors that would lead to solid, safe software—mostly higher level aspects of socio-technical systems, and I became SRE.

But a part of me still really likes dreaming about the days where I could do hot code loading over entire clusters—see A Pipeline Made of Airbags—and I kept thinking about how I could bring this back, but within the context of complex software teams running CI/CD, building containers, and running them in Kubernetes. This is no short order, because we now have decades of lessons telling everyone that you want your infrastructure to be immutable and declared in code.

I also have a decade of experience telling me a lot of what we've built is a frustrating tower of abstractions over shaky ground. I know I've experienced better, my day job is no longer about slinging my own code, and I have no pretense of respecting the tower of abstraction itself.

A weed is a plant considered undesirable in a particular situation, "a plant in the wrong place". Examples commonly are plants unwanted in human-controlled settings, such as farm fields, gardens, lawns, and parks. Taxonomically, the term "weed" has no botanical significance, because a plant that is a weed in one context is not a weed when growing in a situation where it is wanted.

Like the weeds that decide that the tiniest crack in a driveway is actually real cool soil to grow in, I've decided to do the best thing given the situation and bring back Erlang live code upgrades to modern CI/CD, containerized and kubernetized infrastructure.

If you want the TL:DR; I wrote the dandelion project, which shows how to go from an Erlang/OTP app, automate the generation of live code upgrade instructions with the help of pre-existing tools and CI/CD, generate a manifest file and store build artifacts, and write the necessary configuration to have Kubernetes run said containers and do automated live code upgrades despite its best attempts at providing immutable images. Then I pepper in some basic CI scaffolding to make live code upgrading a tiny bit less risky. This post describes how it all works.

A Sample App

A lot of "no downtime" deployments you'll find for Kubernetes are actually just rolling updates with graceful connection termination. Those are always worth supporting (even if it can be annoying to get right), but it has a narrow definition of downtime that's different from what we're aiming for here: no need to restart the application, dump and re-hydrate the state, nor to drop a single connection.

A small server with persistent connections

I wrote a really trivial application, nothing worth calling home about. It's a tiny TCP server with a bunch of acceptors where you can connect with netcat (nc localhost 8080) and it just displays stuff. This is the bare minimum to show actual "no downtime": a single instance, changing its code definitions in a way that is directly observable to a client, without ever dropping a connection.

The application follows a standard release structure for Erlang, using Rebar3 as a build tool. Its supervision structure looks like this:

               top level
              supervisor
               /      \
        connection   acceptors
        supervisor   supervisor
            |            |
        connection    acceptor
         workers        pool
       (1 per conn)

The acceptors supervisor starts a TCP listen socket, which is passed to each worker in the acceptor pool. Upon each accepted connection, a connection worker is started and handed the final socket. The connection worker then sits in an event loop. Every second it sends in a small ASCII drawing of a dandelion, and for every packet of data it receives (coalesced or otherwise), it sends in a line containing its version.

A netcat session looks like this:

$ nc localhost 8080

   @
 \ |
__\!/__


   @
 \ |
__\!/__

> ping!
vsn: 0.1.5

   @
 \ |
__\!/__

^C

Universal container: a plan

Modern deployments are often done with containers and Kubernetes. I'm assuming you're familiar with both of these concepts, but if you want more information—in the context of Erlang—then Tristan Sloughter wrote a great article on Docker and Erlang and another one on using Kubernetes for production apps.

In this post, I'm interested in doing two things:

  1. Have an equivalent to Joe Armstrong's Universal server
  2. Force the immutable world to nevertheless let me do live code upgrades

The trick here is deceptively simple, enough to think "that can't be a good idea." It probably isn't.

A tracing of Nathan fielder

The plan? Use a regular container someone else maintains and just wedge my program's tarball in there. I can then use a sidecar to automate fetching updates and applying live code upgrades without Kubernetes knowing anything about it.

Erlang Releases: a detour

To understand how this can work, we first need to cover the basics of Erlang releases. A good overview of the Erlang Virtual Machine's structure is an article I've written in OTP at a high level, but that I can summarize here by describing the following layers, from lowest to highest:

  • There is an Erlang Runtime System (ERTS), which is essentially the VM itself, written in C. This can't be live upgraded, and offers features around Erlang's immutability, preemptive scheduling, memory allocation, garbage collection, and so on.
  • A few pre-loaded modules that offer core functionality around files and sockets. I don't believe that these get to be live upgraded either.
  • There is a pair of libraries ("applications" in Erlang-speak) called kernel and stdlib that define the most basic libraries around list processing, distributed programs, and the core of "OTP", the general development framework in the language. These can be live upgraded, but pretty much nobody does that
  • Then we have the Erlang standard library. This includes things such as TLS support, HTTP clients and servers, Wx bindings, test frameworks, the compiler, and extra scaffolding around OTP niceties, to name a few.

Your own project is pretty much just your own applications ("libraries"), bundled with select standard library libraries and a copy of the Erlang system:

release schematic drawing

The end result of this sort of ordeal is that every Erlang project is pretty much people writing their own libraries (in blue in the drawing above), fetching bits from the Erlang base install (in red in the drawing above), and then using tools (such as Rebar3) to repackage everything into a brand new Erlang distribution. A detailed explanation of how this happens is also in Adopting Erlang.

Whatever you built on one system can be deployed on an equivalent system. If you built your app on a 64 bit linux—and assuming you used static libraries for OpenSSL or LibreSSL, or have equivalent ones on a target system—then you can pack a tarball, unpack it on the other host system, and get going. Those requirements don't apply to your code, only the standard library. If you don't use NIFs or other C extensions, your own Erlang code, once built, is fully portable.

The cool thing is that Erlang supports making a sort of "partial" release, where you take the Erlang/OTP part (red) and your own apps (blue) in the above image, only package your own app's and a sort of "figure out the Erlang/OTP part at run-time" instruction, and your application is going to be entirely portable across all platforms (Windows, Linux, BSD, MacOS) and supported architectures (x86, ARM32, ARM64, etc.)

I'm mentioning this because for the sake of this experiment, I'm running things locally on a M1 macbook air (ARM64), with MicroK8s (which runs a Linux/aarch64), but am using Github Actions, which are on an x86 linux. So rather than using a base ubuntu image and then needing to run the same sort of hardware family everywhere down the build chain, I'll be using an Erlang image from dockerhub to provide the ERTS and stdlib, and will then have the ability to make portable builds from either my laptop or github actions and deploy them onto any Kubernetes cluster—something noticeably nicer than having to deal with cross-compilation in any language.

Controlling Releases

The release definition for Dandelion accounts for the above factors, and looks like this:

{relx,[{release,{dandelion,"0.1.5"},% my release and its version (dandelion-0.1.5)[dandelion,% includes the 'dandelion' app and its depssasl]},% and the 'sasl' library, which is needed for% live code upgrades to work%% set runtime configuration values based on the environment in this file{sys_config_src,"./config/sys.config.src"},%% set VM options in this file{vm_args,"./config/vm.args"},%% drop source files and the ERTS, but keep debug annotations%% which are useful for various tools, including automated%% live code upgrade plugins{include_src,false},{include_erts,false},{debug_info,keep},{dev_mode,false}]}.

The release can be built by calling rebar3 release and packaged by calling rebar3 tar.

Take the resulting tarball, unpack it, and you'll get a bunch of directories: lib/ contains the build artifact for all libraries, releases/ contains metadata about the current release version (and the structure to store future and past versions when doing live code upgrades), and finally the bin/ directory contains a bunch of accessory scripts to load and run the final code.

Call bin/dandelion and a bunch of options show up:

$ bin/dandelion
Usage: dandelion [COMMAND] [ARGS]

Commands:

  foreground              Start release with output to stdout
  remote_console          Connect remote shell to running node
  rpc [Mod [Fun [Args]]]] Run apply(Mod, Fun, Args) on running node
  eval [Exprs]            Run expressions on running node
  stop                    Stop the running node
  restart                 Restart the applications but not the VM
  reboot                  Reboot the entire VM
...
  upgrade [Version]       Upgrade the running release to a new version
  downgrade [Version]     Downgrade the running release to a new version
  install [Version]       Install a release
  uninstall [Version]     Uninstall a release
  unpack [Version]        Unpack a release tarball
  versions                Print versions of the release available
...

So in short, your program's lifecycle can become:

  • bin/dandelion foreground boots the app (use bin/dandelion console to boot a version in a REPL)
  • bin/dandelion remote_console pops up a REPL onto the running app by using distributed Erlang

If you're doing the usual immutable infrastructure, that's it, you don't need much more. If you're doing live code upgrades, you then have a few extra steps:

  1. Write a new version of the app
  2. Give it some instructions about how to do its live code upgrade
  3. Pack that in a new version of the release
  4. Put the tarball in releases/
  5. Call bin/dandelion unpack <version> and the Erlang VM will unpack the new tarball into its regular structure
  6. Call bin/dandelion install <version> to get the Erlang VM in your release to start tracking the new version (without switching to it)
  7. Call bin/dandelion upgrade <version> to apply the live code upgrade

And from that point on, the new release version is live.

Hot Code Upgrade Instructions

I've sort of papered over the complexity required to "give it some instructions about how to do its live code upgrade." This area is generally really annoying and complex. You first start with appup files, which contain instructions on upgrading individual libraries, which are then packaged into a relup which provides instructions for coordinating the overall upgrade.

If you're running live code upgrades on a frequent basis you may want to get familiar with these, but most people never bothered, and the vast majority of live code upgrades are done by people writing manual scripts to load specific modules.

A very nice solution that also exists is to use Luis Rascão's rebar3_appup_plugin which will take two releases, compare their code, and auto-generate instructions on your behalf. By using it, most of the annoyances and challenges are automatically covered for you.

All you need to do is to make sure all versions are adequately bumped, do a few command line invocations, and package it up. This will be a prime candidate for automation soon in this post.

For now though, let's assume we'll just put the release in an S3 bucket that the kubernetes cluster has access to, and build our infrastructure on the Kubernetes side.

Universal container: a kubernetes story

Let's escape the Erlang complexity and don our DevOps hat. We now want to run the code we assume has made it safely to S3. All of it beautifully holds into a single YAML file—which, granted, can't really be beautiful on its own. I use three containers in a single kubernetes pod:

containers schematic drawing

All of these containers will share a 'release' directory, by using an EmptyDir volume. The bootstrap container will fetch the latest release and unpack it there, the dandelion-release container will run it, and the sidecar will be able to interact over the network to manage live code upgrades.

The bootstrap pod runs first, and fetches the first (and current) release from S3. I'm doing so by assuming we'll have a manifest file (<my-s3-bucket>/dandelion-latest) that contains a single version number that points to the tarball I want (<my-s3-bucket>/dandelion-<version>.tar.gz). This can be done with a shell script:

#!/usr/bin/env bashset -euxo pipefail
RELDIR=${1:-/release}S3_URL="https://${BUCKET_NAME}.s3.${AWS_REGION}.amazonaws.com"TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)
wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "/tmp/${RELEASE}-${TAG}.tar.gz"
tar -xvf "/tmp/${RELEASE}-${TAG}.tar.gz" -C ${RELDIR}
rm "/tmp/${RELEASE}-${TAG}.tar.gz"

This fetches the manifest, grabs the tag, fetches the release, unpacks it, and deletes the old tarball. The dandelion-release container, which will run our main app, can then just call the bin/dandelion script directly:

#!/usr/bin/env bashset -euxo pipefail
RELDIR=${1:-/release}exec${RELDIR}/bin/${RELEASE} foreground

The sidecar is a bit more tricky, but can reuse the same mechanisms. Every time interval (or based on a feature flag or some server-sent signal), check the manifest, and apply the unpacking steps. Something a bit like:

#!/usr/bin/env bashset -euxo pipefail
RELDIR=${2:-/release}S3_URL="https://${BUCKET_NAME}.s3.${AWS_REGION}.amazonaws.com"CURRENT=$(${RELDIR}/bin/${RELEASE} versions | awk '$3=="permanent" && !vsn { vsn=$2 } $3=="current" { vsn=$2 } END { print vsn }')TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)if[["${CURRENT}" !="${TAG}"]];then
    wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "${RELDIR}/releases/${RELEASE}-${TAG}.tar.gz"${RELDIR}/bin/${RELEASE} unpack ${TAG}${RELDIR}/bin/${RELEASE} install ${TAG}${RELDIR}/bin/${RELEASE} upgrade ${TAG}fi

Call this in a loop and you're good to go.

Now here's the fun bit: ConfigMaps are a Kubernetes thing that lets you take arbitrary metadata, and optionally use them as files into pods. This is how we get close to our universal container.

By declaring the three scripts above as a ConfigMap and mounting them in a /scripts directory, we can then declare the 3 containers in a generic fashion:

initContainers:-name:dandelion-bootstrapimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/init-latest.sh# Regular containers run nextcontainers:-name:dandelion-releaseimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/boot-release.shports:-containerPort:8080hostPort:8080-name:dandelion-sidecarimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/update-loop.sh

The full file has more details, but this is essentially all we need. You could kubectl apply -f dandelion.yaml and it would get going on its own. The rest is about providing a better developer experience.

Making it Usable

What we have defined now is an expected format and procedure from Erlang's side to generate code and live upgrade instructions, and a wedge to make this usable within Kubernetes' own structure. This procedure is somewhat messy, and there are a lot of technical aspects that need to be coordinated to make this usable.

Now comes the time to work around providing a useful workflow for this.

Introducing Smoothver

Semver's alright. Most of the time I won't really care about it, though. I'll go read the changelog and see if whatever I depend on has changed or not. People will pick versions for whichever factor they want, and they'll very often put a small breaking change (maybe a bug fix!) as non-breaking because there's an intent being communicated by the version.

Here the semver semantics are not useful. I've just defined a workflow that mostly depends on whether the server can be upgraded live or not, with some minor variations. This operational concern is likely to be the main concern of engineers who would work on such an application daily, particularly since as someone deploying and maintaining server-side software, I mostly own the whole pipeline and always consider the main branch to be canonical.

As such, I should feel free to develop my own versioning scheme. Since I'm trying to reorient Dandelion's whole flow towards continuous live delivery, my versioning scheme should actually reflect and support that effort. I therefore introduce Smoothver (Smooth Versioning):

Given a version number RESTART.RELUP.RELOAD, increment the:

  • RESTART version when you make a change that requires the server to be rebooted.
  • RELUP version when you make a change that requires pausing workers and migrating state.
  • RELOAD version when you make a change that requires reloading modules with no other transformation.

The version number now communicates the relative expected risk of a deployment in terms of disruptiveness, carries some meaning around the magnitude of change taking place, and can be leveraged by tooling.

For example:

  • Fixing a bug in a data structure is a RELOAD deploy so long as the internal representation does not change (e.g. swapping a > for a >= in a comparison)
  • Adding a new endpoint or route to an existing HTTP API is likely a RELOAD deploy since no existing state relies on it. Rolling it back is a business concern, not a technical one.
  • Adding or changing a field in a data structure representing a user is a RELUP operation, since rolling forward or backward implies a data transformation to remain compatible
  • Upgrading the VM version is a RESTART because the C code of the VM itself can't change
  • Bumping a stateful dependency that does not provide live code upgrades forces a RESTART version bump

As with anything we do, the version bump may be wrong. But it at least carries a certain safety level in letting you know that a RESTART live code upgrade should absolutely not be attempted.

Engineers who get more familiar with live code upgrades will also learn some interesting lessons. For example, a RELUP change over a process that has tens of thousands of copies of itself may take a long long time to run and be worse than a rolling upgrade. An interesting thing you can do then is turn RELUP changes (which would require calling code change instructions) into basic code reloads by pattern matching an old structure and converting it on each call, turning it into a somewhat stateless roll-forward affair.

That's essentially converting operational burdens into dirtier code, but this sort of thing is something you do all the time with database migrations (create a new table, double-write, write only to the new one, delete the old table) and that can now be done with running code.

For a new development workflow that tries to orient itself towards live code upgrades, Smoothver is likely to carry a lot more useful information than Semver would (and maybe could be nice for database migrations as well, since they share concerns).

Publishing the Artifacts

I needed to introduce the versioning mechanism because the overall publication workflow will obey it. If you're generating a new release version that requires a RESTART bump, then don't bother generating live code upgrade instructions. If you're generating anything else, do include them.

I've decided to center my workflow around git tags. If you tag your release v1.2.3, then v1.2.4 or v1.4.1 all do a live code upgrade, but v2.0.0 won't, regardless of which branch they go to. The CI script is not too complicated, and is in three parts:

  1. Fetch the currently deployed manifest, and see if the newly tagged version requires a live code upgrade ("relup") or not
  2. Build the release tarball with the relup instructions if needed. Here I rely purely on Luis's plugin to handle all the instructions.
  3. Put the files on S3

That's really all there is to it. I'm assuming that if you wanted to have more environments, you could setup gitops by having more tags (staging-v1.2.3, prod-v1.2.5) and more S3 buckets or paths. But everything is assumed to be driven by these builds artifacts.

A small caveat here is that it's technically possible to generate upgrade instructions (appup files) that map from many to many versions: how to update to 1.2.0 from 1.0.0, 1.0.1, 1.0.2, and so on. Since I'm assuming a linear deployment flow here, I'm just ignoring that and always generating pairs from "whatever is in prod" to "whatever has been tagged". There are obvious race conditions in doing this, where two releases generated in parallel can specify upgrade rules from a shared release, but could be applied and rolled out in a distinct order.

Using the Manifest and Smoothver

Relying on the manifest and versions is requiring a few extra lines in the sidecar's update loop. They look at the version, and if it's a RESTART bump or an older release, they ignore it:

# Get the running versionCURRENT=$(${RELDIR}/bin/${RELEASE} versions | awk '$3=="permanent" && !vsn { vsn=$2 } $3=="current" { vsn=$2 } END { print vsn }')TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)if[["${CURRENT}" !="${TAG}"]];thenIS_UPGRADE=$(echo"$TAG$CURRENT"| awk -vFS='[. ]''($1==$4 && $2>$5) || ($1==$4 && $2>=$5 && $3>$6) {print 1; exit} {print 0}')if[[$IS_UPGRADE -eq 1]];then
      wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "${RELDIR}/releases/${RELEASE}-${TAG}.tar.gz"${RELDIR}/bin/${RELEASE} unpack ${TAG}${RELDIR}/bin/${RELEASE} install ${TAG}${RELDIR}/bin/${RELEASE} upgrade ${TAG}fifi

There's some ugly awk logic, but I wanted to not host images. The script could be made a lot more solid by looking at whether we're bumping from the proper version to the next one, and in this it shares a sort of similar race condition to the generation step.

On the other hand, the install step looks at the specified upgrade instructions and will refuse to apply itself (resulting in a sidecar crash) if a bad release is applied.

I figure that alerting on crashed sidecars could be used to drive further automation to ask to delete and replace the pods, resulting in a rolling upgrade. Alternatively, the error itself could be used to trigger a failure in liveness and/or readiness probes, and force-automate that replacement. This is left as an exercise to the reader, I guess. The beauty of writing prototypes is that you can just decide this to be out of scope and move on, and let someone who's paid to operationalize that stuff to figure out the rest.

Oh and if you just change the Erlang VM's version? That changes the kubernetes YAML file, and if you're using anything like helm or some CD system (like ArgoCD), these will take care of running the rolling upgrade for you. Similarly, annotating the chart with a label of some sort indicating the RESTART version will accomplish the same purpose.

You may rightfully ask whether it is a good idea to bring mutability of this sort to a containerized world. I think that using S3 artifacts isn't inherently less safe than a container registry, dynamic feature flags, or relying on encryption services or DNS records for functional application state. I'll leave it at that.

Adding CI validation

Versioning things is really annoying. Each OTP app and library, and each release needs to be versioned properly. And sometimes you change dependencies and these dependencies won't have relup instructions available but you didn't know and that would break your live code upgrade.

What we can do is add a touch of automation to catch the most obvious failure situations and warn developers early about these issues. I've done so by adding a quick relup CI step to all pull requests, by using a version check script that encodes most of that logic.

The other thing I started experimenting with was setting up some sort of test suite for live code upgrades:

# This here step is a working sample, but if you were to run a more# complex app with external dependencies, you'd also have to do a# more intricate multi-service setup here, e.g.:# https://github.com/actions/example-services-name:Run relup applicationworking-directory:erlangrun:|mkdir relupcitar -xvf "${{ env.OLD_TAR }}" -C relupci# use a simple "run the task in the background" setuprelupci/bin/dandelion daemonTAG=$(echo "${{ env.NEW_TAR }}"  | sed -nr 's/^.*([0-9]+\.[0-9]+\.[0-9]+)\.tar\.gz$/\1/p')cp "${{ env.NEW_TAR }}" relupci/releases/relupci/bin/dandelion unpack ${TAG}relupci/bin/dandelion install ${TAG}relupci/bin/dandelion upgrade ${TAG}relupci/bin/dandelion versions

The one thing that would make this one a lot cooler is to write a small extra app or release that runs in the background while the upgrade procedure goes on. It could do things like:

  • generate constant load
  • run smoke tests on critical workflows
  • maintain live connections to show lack of failures
  • report on its state on demand

By starting that process before the live upgrade and questioning it after, we could ensure that the whole process went smoothly. Additional steps could also look at logs to know if things were fine.

The advantage of adding CI here is that each pull request can take measures to ensure it is safely upgradable live before being merged to main, even if none of them are deployed right away. By setting that gate in place, engineers are getting a much shorter feedback loop asking them to think about live deployments.

Running through a live code upgrade

I've run through a few iterations to test and check everything. I've set up microk8s on my laptop, ran kubectl -f apply dandelion.yaml and showed that the pod was up and running fine:

$ kubectl -n dandelion get pods
NAME                                    READY   STATUS    RESTARTS   AGE
dandelion-deployment-648db88f44-49jl8   2/2     Running   0          25H

It is possible to run into one of the containers, log on onto a REPL, and see what is going on:

$ kubectl-ndandelionexec-i-tdandelion-deployment-648db88f44-49jl8-cdandelion-sidecar--/bin/bashroot@dandelion-deployment-648db88f44-49jl8:/#/release/bin/dandelionremote_consoleErlang/OTP25[erts-13.0.2][source][64-bit][smp:2:2][ds:2:2:10][async-threads:1][jit]EshellV13.0.2(abortwith^G)(dandelion@localhost)1>release_handler:which_releases().[{"dandelion","0.1.5",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.5","sasl-4.2"],permanent},{"dandelion","0.1.4",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.4","sasl-4.2"],old}]

This shows that the container had been running for a day, and already had two releases—it first booted on version 0.1.4 and had already gone through a bump to 0.1.5. I ran a small pull request changing the display (and messed up versioning, which CI caught!), merged it, tagged it v0.1.6, and started listening to my Kubernetes cluster:

$ nc 192.168.64.2 8080

   @
 \ |
__\!/__

...

   @
 \ |
__\!/__

vsn?
vsn: 0.1.5

   @
 \ |
__\!/__

      *
   @
 \ |
__\!/__

vsn?
vsn: 0.1.6
      *
   @
 \ |
__\!/__

...

This shows me interrogating the app (vsn?) and getting the version back, and without dropping the connection, having a little pappus floating in the air.

My REPL session was still live in another terminal:

(dandelion@localhost)2>release_handler:which_releases().[{"dandelion","0.1.6",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.6","sasl-4.2"],permanent},{"dandelion","0.1.5",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.5","sasl-4.2"],old},{"dandelion","0.1.4",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.4","sasl-4.2"],old}]

showing that the old releases are still around as well. And here we have it, an actual zero-downtime deploy in a kubernetes container.

Conclusion

Joe's favorite program could hold on a business card. Mine is maddening. But I think this is because Joe didn't care for the toolchains people were building and just wanted to do his thing. My version reflects the infrastructure we have put in place, and the processes we want and need for a team.

Rather than judging the scaffolding, I'd invite you to think about what would change when you start centering your workflow around a living system.

Those of you who have worked with bigger applications that have a central database or shared schemas around network protocols (or protobuf files or whatever) know that you approach your work differently when you have to consider how it's going to be rolled out. It impacts your tooling, how you review changes, how you write them, and ultimately just changes how you reason about your code and changes.

In many ways it's a more cumbersome way to deploy and develop code, but you can also think of the other things that change: what if instead of having configuration management systems, you could hard-code your config in constants that just get rolled out live in less than a minute—and all your configs were tested as well as your code? Since all the release upgrades implicitly contain a release downgrade instruction set, just how much faster could you rollback (or automate rolling back) a bad deployment? Would you be less afraid of changing network-level schema definitions if you made a habit of changing them within your app? How would your workflow change if deploying took half-a-second and caused absolutely no churn nor disruption to your cluster resources most of the time?

Whatever structure we have in place guides a lot of invisible emergent behaviour, both in code and in how we adjust ourselves to the structure. Much of what we do is a tacit response to our environment. There's a lot of power in experimenting alternative structures, and seeing what pops up at the other end. A weed is only considered as such in some contexts. This is a freak show of a deployment mechanism, but it sort of works, and maybe it's time to appreciate the dandelions for what they can offer.

Permalink

Erlang/OTP 25.0 Release

Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities.

For details about new features, bugfixes and potential incompatibilities see the Erlang 25.0 README or the Erlang/OTP 25.0 downloads page.

Many thanks to all contributors!

Erlang/OTP 25.0 Highlights

stdlib

  • New function filelib:ensure_path/1 will ensure that all directories for the given path exists
  • New functions groups_from_list/2 and groups_from_list/3 in the maps module
  • New functions uniq/1 uniq/2 in the lists module
  • New PRNG added to the rand module, for fast pseudo-random numers.

compiler, kernel, stdlib, syntax_tools

  • Added support for selectable features as described in EEP-60. Features can be enabled/disabled during compilation with options (ordinary and +term) to erlc as well as with directives in the file. Similar options can be used to erl for enabling/disabling features allowed at runtime. The new maybe expression EEP-49 is fully supported as the feature maybe_expr.

erts & JIT

  • The JIT now works for 64-bit ARM processors.
  • The JIT now does type-based optimizations based on type information in the BEAM files.
  • Improved the JIT’s support for external tools like perf and gdb, allowing them to show line numbers and even the original Erlang source code when that can be found.

erts, stdlib, kernel

  • Users can now configure ETS tables with the {write_concurrency, auto} option. This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. The {decentralized_counters, true} option is enabled by default when {write_concurrency, auto} is active.
    Benchmark results comparing this option with the other ETS optimization options are available here: benchmarks.
  • To enable more optimizations, BEAM files compiled with OTP 21 and earlier cannot be loaded in OTP 25.
  • The signal queue of a process with the process flag message_queue_data=off_heap has been optimized to allow parallel reception of signals from multiple processes. This can improve performance when many processes are sending in parallel to one process. See benchmark.
  • The Erlang installation directory is now relocatable on the file system given that the paths in the installation’s RELEASES file are paths that are relative to the installations root directory.
  • A new option called short has been added to the functions erlang:float_to_list/2 and erlang:float_to_binary/2. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.
  • Introduction of quote/1 and unquote/1 functions in the uri_string module - a replacement for the deprecated functions http_uri:encode and http_uri:decode.
  • The new module peer supersedes the slave module. The slave module is now deprecated and will be removed in OTP 27.
  • global will now by default prevent overlapping partitions due to network issues. This is done by actively disconnecting from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions.
    It is possible to turn off the new behavior by setting the the kernel configuration parameter prevent_overlapping_partitions to false. Doing this will retain the same behavior as in OTP 24 and earlier.
  • The format_status/2 callback for gen_server, gen_statem and gen_event has been deprecated in favor of the new format_status/1 callback.
    The new callback adds the possibility to limit and change many more things than the just the state.
  • The timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The timer:sleep/1 function now accepts an arbitrarily large integer.

compiler

  • The maybe ... end construction as proposed in EEP-49 has been implemented. It can simplify complex code where otherwise deeply nested cases would have to be used.
    To enable maybe, give the option {enable_feature,maybe_expr} to the compiler. The exact option to use will change in a coming release candidate and then it will also be possible to use from inside the module being compiled.
  • When a record matching or record update fails, a {badrecord, ExpectedRecordTag} exception used to be raised. In this release, the exception has been changed to {badrecord, ActualValue}, where ActualValue is the value that was found instead of the expected record.
  • Add compile attribute -nifs() to empower compiler and loader with information about which functions may be overridden as NIFs by erlang:load_nif/2.
  • Improved and more detailed error messages when binary construction with the binary syntax fails. This applies both for error messages in the shell and for erl_error:format_exception/3,4.
  • Change format of feature options and directives for better consistency. Options to erlc and the -compile(..) directive now has the format {feature, feature-name, enable | disable}. The -feature(..) now has the format -feature(feature-name, enable | disable).

crypto

  • Add crypto:hash_equals/2 which is a constant time comparision of hashvalues.

ssl

  • Introducing a new (still experimental) option {certs_keys,[cert_key_conf()]}. With this a list of a certificates with their associated key may be used to authenticate the client or the server. The certificate key pair that is considered best and matches negotiated parameters for the connection will be selected.

public_key

  • Functions for retrieving OS provided CA-certs added.

dialyzer

  • Optimize operations in the erl_types module. Parallelize the Dialyzer pass remote.
  • Added the missing_return and extra_return options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose as overspecs and underspecs.
  • Dialyzer now better understands the types for min/2, max/2, and erlang:raise/3. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use erlang:raise/3 could now need a spec with a no_return() return type to avoid an unwanted warning.

Misc

  • A new DEVELOPMENT HOWTO guide has been added that describes how to build and test Erlang/OTP when fixing bugs or developing new functionality.
  • Testing has been added to the Github actions run for each opened PR so that more bugs are caught earlier when bug fixes and new features are proposed.

Download links for this and previous versions are found here

Permalink

Errors are constructed, not discovered

2022/04/13

Errors are constructed, not discovered

Over a year ago, I left the role of software engineering* behind to become a site reliability engineer* at Honeycomb.io. Since then, I've been writing a bunch of blog posts over there rather than over here, including the following:

There are also a couple of incident reviews, including one on a Kafka migration and another on a spate of scaling-related incidents.

Either way, I only have so much things to rant about to fill in two blogs, so this place here has been a bit calmer. However, I recently gave a talk at IRConf (video).

I am reproducing this talk here because, well, I stand behind the content, but it would also not fit the work blog's format. I'm also taking this opportunity because I don't know how many talks I'll give in the next few years. I've decided to limit how much traveling I do for conferences due to environmental concerns—if you see me at a conference, it either overlapped with other opportunistic trips either for work or vacations, or they were close enough for me to attend them via less polluting means—and so I'd like to still post some of the interesting talks I have when I can.

The Talk

This talk is first of all about the idea that errors are not objective truths. Even if we look at objective facts with a lot of care, errors are arbitrary interpretations that we come up with, constructions that depend on the perspective we have. Think of them the same way constellations in the night sky are made up of real stars, but their shape and meaning are made up based on our point of view and what their shapes remind us of.

The other thing this talk will be about is ideas about what we can do once we accept this idea, to figure out the sort of changes and benefits we can get from our post-incident process when we adjust to it.

I tend to enjoy incidents a lot. Most of the things in this talk aren't original ideas, they're things I've read and learned from smarter, more experienced people, and that I've put back together after digesting them for a long time. In fact, I thought my title for this talk was clever, but as I found out by accident a few days ago, it's an almost pure paraphrasing of a quote in a book I've read over 3 years ago. So I can't properly give attribution for all these ideas because I don't know where they're from anymore, and I'm sorry about that.

A quote: 'Error' serves a number of functions for an organisation: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.

This is a quote from "Those found responsible have been sacked": some observations on the usefulness of error" that I'm using because even if errors are arbitrary constructions, they carry meaning, and they are useful to organizations. The paper defines four types I'll be paraphrasing:

  • Defense against entanglement: the concept of error or fault is a way for an organization to shield itself from the liabilities of an incident. By putting the fault on a given operator, we avoid having to question the organization's own mechanisms, and safely deflect it away.
  • Illusion of control: by focusing on individuals and creating procedures, we can preserve the idea that we can manage the world rather than having to admit that adverse events will happen again. This gives us a sort of comfort.
  • Distancing: this is generally about being able to maintain the idea that "this couldn't happen here", either because we are doing things differently or because we are different people with different practices. This also gives us a decent amount of comfort.
  • Failed investigation: finally, safety experts seem to see the concept of error, particularly human error, as a marker that the incident investigation has ended too early. There were more interesting things to dig into and that hasn't been done—because the human error itself is worth understanding as an event.

So generally, error is useful as a concept, but as an investigator it is particularly useful as a signal to tell you when things get interesting, not as an explanation on their own.

An iceberg above and below the waterline with labels pointing randomly. Above the waterline are operations (scaling, alerting, deploying), and below the waterline are code reviews, testing, values, experience, roadmap, training, behaviours rewarded and punished, etc.

And so this sort of makes me think about how a lot of incident reviews tend to go on. We use the incident as an opportunity because the disruption is big and large enough to let us think about it all. But the natural framing that easily comes through is to lay blame to the operational area.

Here I don't mean blame as in "people fucked up" nearly as much as "where do we think the organisation needs to improve"—where do we think that as a group we need to improve as a result of this. The incident and the operations are the surface, they often need improvement for sure because it is really tricky work done in special circumstances and it's worth constantly adjusting it, but stopping there is missing on a lot of possible content that could be useful.

People doing the operations are more or less thrown in a context where a lot of big decisions have been made already. Whatever was tested, who was hired, what the budgets are, and all these sorts of pressures are in a large part defined by the rest of the organization, they matter as well. They set the context in which operations take place.

So one question then, is how do we go from that surface-level vision, and start figuring out what happens below that organisational waterline.

timelines are necessary: a black line with dots on it and the last end is an explosion

The first step of almost any incident investigation is to start with a timeline. Something that lets us go back from the incident or its resolution, and that we use as breadcrumbs to lead us towards ways to prevent this sort of thing from happening again. So we start at the bottom where things go boom, and walk backwards from there.

the same timeline, but with labels for failure, proximate cause, root cause, and steady state

The usual frameworks we're familiar with apply labels to these common patterns. We'll call something a failure or an error. The thing that happened right before it tends to be called a proximate cause, which is frequently used in insurance situations: it's the last event in the whole chain that could have prevented the failure from happening. Then we walk back. Either five times because we're doing the five whys, or until you land at a convenient place. If there is a mechanical or software component you don't like, you're likely to highlight its flaws there. If it's people or teams you don't trust as much, you may find human error there.

Even the concept of steady state is a shaky one. Large systems are always in some weird degraded state. In short, you find what you're looking for. The labels we use, the lens through which we look at the incident, influence the way we build our explanations.

the timeline is overlaid with light-grey branches that show paths sometimes leading to failures and sometimes not. Those are paths not taken, or that were visible to the practitioner

The overall system is not any of the specific lenses, though, it's a whole set of interactions. To get a fuller richer picture, we have to account for what things looked liked at the time, not just our hindsight-fuelled vision when looking behind. There are a lot of things happening concurrently, a lot of decisions made to avoid bad situations that never took place, and some that did.

Hindsight bias is something somewhat similar to outcome bias, which essentially says that because we know there was a failure, every decision we look at that has taken place before the incident will seem to us like it should obviously have appeared as risky and wrong. That's because we know the result, it affects our judgment. But when people were going down that path and deciding what to do, they were trying to do a good job; they were making the best calls they could to the extent of their abilities and the information available at the time.

We can't really avoid hindsight bias, but we can be aware of it. One tip there is to look at what was available at the time, and consider the signals that were available to people. If they made a decision that looks weird, then look for what made it look better than the alternatives back then.

Counterfactuals are another thing to avoid, and they're one of the trickiest ones to eliminate from incident reviews. They essentially are suppositions about things that have never happened and will never happen. Whenever we say "oh, if we had done this at this point in time instead of that, then the incident would have been prevented." They're talking about a fictional universe that never happened, and they're not productive.

I find it useful to always cast these comments into the future: "next time this happens, we should try to try that to prevent an issue." This orients the discussion towards more realistic means: how can we make this option more likely? the bad one less interesting? In many cases, a suggestion will even become useless: by changing something else in the system, a given scenario may no longer be a concern for the future, or it may highlight how a possible fix would in fact create more confusion.

Finally, normative judgments. Those are often close to counterfactuals, but you can spot them because they tend to be about what people should or shouldn't have done, often around procedures or questions of competence. "The engineer should have checked the output more carefully, and they shouldn't have run the command without checking with someone else, as stated in the procedure." Well they did because it arguably looked reasonable at the time!

The risk with a counterfactual judgment is that it assumes that the established procedure is correct and realistically applicable to the situation at hand. It assumes that deviations and adjustments made by the responder are bound to fail even if we'll conveniently ignore all the times they work. We can't properly correct procedures if we think they're already perfect and it's wrong not to obey them, and we can't improve tooling if we believe the problem is always the person holding it wrong.

Another timeline representation of an incident, flatly linear. It has elements like alert, logging on, taking a look. Then a big gap. Then the issue is found, and the fix is written, and the issue closed

A key factor is to understand that in high pressure incident responses, failure and successes use the same mechanisms. We're often tired, distracted, or possibly thrown in there without adequate preparation. What we do to try and make things go right and often succeed through is also in play when things go wrong.

People look for signals, and have a background and training that influences the tough calls that usually will be shared across situations. We tend to want things to go right. The outcome tends to define whether the decision was good one or not, but the decision-making mechanism is shared both for decisions that go well and those that do not. And so we need to look at how these decisions are made with the best of intentions to have any chance of improving how events unfold the next time.

This leads to the idea you want to look at what's not visible, because they show real work.

Same timeline, but the big gap is highlighted and says 'this is where we repair our understanding of the world'

I say this is "real work" because we come in to a task with an understanding of things, a sort of mental model. That mental model is the rolled up experience we have, and lets us frame all the events we encounter, and is the thing we use to predict the consequences of our decisions.

When we are in an incident, there's almost always a surprise in there, which means that the world and our mental model are clashing. This mismatch between our understanding of the world and the real world was already there. That gap between both needs to be closed, and the big holes in an incident's timelines tend to be one of the places where this takes place.

Whenever someone reports "nothing relevant happens here", these are generally the places where active hypothesis generation periods happen, where a lot of the repair and gap bridging is taking place.

This is where the incident can become a very interesting window into the whole organizational iceberg below the water line.

An iceberg above and below the waterline with labels pointing randomly. Above the waterline are operations (scaling, alerting, deploying), and below the waterline are code reviews, testing, values, experience, roadmap, training, behaviours rewarded and punished, etc.

So looking back at the iceberg, looking at how decisions are made in the moment lets you glimpse at the values below the waterline that are in play. What are people looking at, how are they making their decisions. What's their perspective. These are impacted by everything else that happens before.

If you see concurrent outages or multiple systems impacted, digging into which one gets resolved first and why that is is likely to give you insights about what responders feel confident about, the things they believe are more important to the organization and users. They can reflect values and priorities.

If you look at who they ask help from and where they look for information (or avoid looking for it), this will let you know about various dynamics, social and otherwise, that might be going on in your organization. This can be because some people are central points of knowledge, others are jerks, seen as more or less competent, or also be about what people believe the state of documentation is at that point in time.

And this is why changing how we look at and construct errors matters. If we take the straightforward causal approach, we'll tend to only skim the surface. Looking at how people do their jobs and how they make decisions is an effective way to go below that waterline, and have a much broader impact than staying above water.

A list of questions such as 'what was your first guess?', 'what made you look at this dashboard?', or 'how were you feeling at the time?'

To take a proper dive, it helps to ask the proper type of questions. As a facilitator, your job is to listen to what people tell you, but there are ways to prompt for more useful information. The Etsy debriefing facilitation guide is a great source, and so is Jeli's Howie guide. The slide contains some of the questions I like to ask most.

There's one story I recall from a previous job where a team had specifically written an incident report on an outage with some service X, and the report had that sort of 30 minutes gap in it and were asking for feedback on it. I instantly asked "so what was going on during this time?" Only for someone on the team to answer "oh, we were looking for the dashboard of service Y". I asked why they had been looking at the dashboard of another service, and he said that the the service's own dashboard isn't trustworthy, and that this one gave a better picture of the health of the service through its effects. And just like that we opened new paths for improvements that were so normal it had become invisible.

Another one also came from a previous job where an engineer kept accidentally deleting production databases and triggering a whole disaster recovery response. They were initially trying to delete a staging database that was dynamically generated for test cases, but kept fat-fingering the removal of production instances in the AWS console. Other engineers were getting mad and felt that person was being incompetent, and were planning to remove all of their AWS console permissions because there also existed an admin tool that did the same thing safely by segmenting environments.

I ended up asking the engineer if there was anything that made them choose the AWS console more than the admin tool given the difference in safety, and they said, quite simply, that the AWS console has an autocomplete and they never remembered the exact table name, so it was just much faster to delete that table often there than the admin. This was an interesting one because instead of blaming the engineer for being incompetent, it opened the door to questioning the gap in tooling rather than adding more blockers and procedures.

In both of these stories, a focus on how people were making their decisions and their direct work experience managed to highlight alternative views that wouldn't have come up otherwise. They can generate new, different insights and action items.

A view of a sequence diagram used for an incident review

And this is the sort of map that, when I have time for it, I tried to generate at Honeycomb. It's non-linear, and the main objective is to help show different patterns about the response. Rather than building a map by categorizing events within a structure, the idea is to lay the events around to see what sort of structure pops up. And then we can walk through the timeline and ask what we were thinking, feeling, or seeing.

The objective is to highlight challenging bits and look at the way we work in a new light. Are there things we trust, distrust? Procedures that don't work well, bits where we feel lost? Focusing on these can improve response in the future.

This idea of focusing on generating rather than categorizing is intended to take an approach that is closer to Qualitative science than Quantitative research.

A comparison of attributes if qualitative vs. quantitative research

The way we structure our reviews will have a large impact on how we construct errors. I tend to favour a qualitative approach to a quantitative one.

A quantitative approach will often look at ways to aggregate data, and create ways to compare one incident to the next. They'll measure things such as the Mean Time To Recovery (MTTR), the impact, the severity, and will look to assign costs and various classifications. This approach will be good to highlight trends and patterns across events, but as far as I can tell, they won't necessarily provide a solid path for practical improvements for any of the issues found.

The qualitative approach by comparison aims to do a deep dive to provide more complete understanding. It tends to be more observational and generative. Instead of cutting up the incidents and classifying its various parts, we look at what was challenging, what are the things people felt were significant during the incident, and all sorts of messy details. These will highlight tricky dynamics, both for high-pace outages and wider organizational practices, and are generally behind the insights that help change things effectively.

A comparison of insights obtained with both approaches (as describe in the text)

To put this difference in context, I have an example from a prior jobs, where one of my first mandates was to try and help with their reliability story. We went over 30 or so incident reports that had been written over the last year, and a pattern that quickly came up was how many reports mentioned "lack of tests" (or lack of good tests) as causes, and had "adding tests" in action items.

By looking at the overall list, our initial diagnosis was that testing practices were challenging. We thought of improving the ergonomics around tests (making them faster) and to also provide training in better ways to test. But then we had another incident where the review reported tests as an issue, so I decided to jump in.

I reached out to the engineers in question and asked about what made them feel like they had enough tests. I said that we often write tests up until the point we feel they're not adding much anymore, and that I was wondering what they were looking at, what made them feel like they had reached the points where they had enough tests. They just told me directly that they knew they didn't have enough tests. In fact, they knew that the code was buggy. But they felt in general that it was safer to be on-time with a broken project than late with a working one. They were afraid that being late would put them in trouble and have someone yell at them for not doing a good job.

And so that revealed a much larger pattern within the organization and its culture. When I went up to upper management, they absolutely believed that engineers were empowered and should feel safe pressing a big red button that stopped feature work if they thought their code wasn't ready. The engineers on that team felt that while this is what they were being told, in practice they'd still get in trouble.

There's no amount of test training that would fix this sort of issue. The engineers knew they didn't have enough tests and they were making that tradeoff willingly.

A smooth line with a zoomed-in area that shows it all bumpy and cracked.

So to conclude on this, the focus should be on understanding the mess:

  • go for a deeper understanding of specific incidents where you feel something intriguing or interesting happened. Aggregates of all incidents tend to hide messy details, so if you have a bunch of reviews to do, it's probably better to be thorough on one interesting one than being shallow on many of them
  • Mental models are how problem solving tends to be done; we understand and predict things based on them. Incidents are amazing opportunities to correct and compare and contrast mental models to make them more accurate or more easy to contextualize
  • seek an understanding of how people do their work. There is always a gap between the work as we imagine it to be and what it actually is. The narrower that gap, the more effective our changes are going to be, so focusing on understanding all the nitty gritty details of work and their pressures is going to prove more useful than having super solid numbers
  • psychological safety is always essential; the thing that lets us narrow the gap between work as done and work as imagined is going to be whether people feel safe in reporting and describing what they go through. Without psychological safety and good blame awareness, you're going to have a hard time getting good results.

Overall, the idea is that looking for understanding more than causes opens up a lot of doors and makes incidents more valuable.


* I can't legally call myself a software engineer, and technically neither can I be a site reliability engineer, because Quebec considers engineering to be a protected discipline. I however, do not really get to tell American employers what they should give as a job title to people, so I get stuck having titles I can't legally advertise but for which no real non-protected forms exist to communicate. So anywhere you see me referred to any sort of "engineer", that's not an official thing I would choose as a title. It'd be nice to know what the non-engineer-titled equivalent of SRE ought to be.

Permalink

OTP 25 Release candidate 3

OTP 25-rc3

Erlang/OTP 25-rc3 is the third and final release candidate before the OTP 25.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues or by posting to Erlangforums or the mailing list erlang-questions@erlang.org.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-13.0-rc3/doc/. You can also install the latest release using kerl like this: kerl build 25.0-rc3 25.0-rc3.

Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Below are some highlights of the release:

Highlights rc3

compiler

Change format of feature options and directives for better consistency. Options to erlc and the -compile(..) directive now has the format {feature, feature-name, enable | disable}. The -feature(..) now has the format -feature(feature-name, enable | disable).

ssl

Introducing a new (still experimental) option {certs_keys,[cert_key_conf()]}. With this a list of a certificates with their associated key may be used to authenticate the client or the server. The certificate key pair that is considered best and matches negotiated parameters for the connection will be selected.

Highlights rc2

stdlib

  • New function filelib:ensure_path/1 will ensure that all directories for the given path exists
  • New functions groups_from_list/2 and groups_from_list/3 in the maps module
  • New functions uniq/1 uniq/2 in the lists module

compiler, kernel, stdlib, syntax_tools

  • Added support for selectable features as described in EEP-60. Features can be enabled/disabled during compilation with options (ordinary and +term) to erlc as well as with directives in the file. Similar options can be used to erl for enabling/disabling features allowed at runtime. The new maybe expression EEP-49 is fully supported as the feature maybe_expr.

Highlights rc1

erts & jit

  • The JIT now works for 64-bit ARM processors.
  • The JIT now does type-based optimizations based on type information in the BEAM files.
  • Improved the JIT’s support for external tools like perf and gdb, allowing them to show line numbers and even the original Erlang source code when that can be found.

erts, stdlib, kernel

  • Users can now configure ETS tables with the {write_concurrency, auto} option. This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. The {decentralized_counters, true} option is enabled by default when {write_concurrency, auto} is active.

    Benchmark results comparing this option with the other ETS optimization options are available here: benchmarks.

  • To enable more optimizations, BEAM files compiled with OTP 21 and earlier cannot be loaded in OTP 25.
  • The signal queue of a process with the process flag message_queue_data=off_heap has been optimized to allow parallel reception of signals from multiple processes. This can improve performance when many processes are sending in parallel to one process. See benchmark.
  • The Erlang installation directory is now relocatable on the file system given that the paths in the installation’s RELEASES file are paths that are relative to the installations root directory.
  • A new option called short has been added to the functions erlang:float_to_list/2 and erlang:float_to_binary/2. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.
  • Introduction of quote/1 and unquote/1 functions in the uri_string module - a replacement for the deprecated functions http_uri:encode and http_uri:decode.
  • The new module peer supersedes the slave module. The slave module is now deprecated and will be removed in OTP 27.
  • global will now by default prevent overlapping partitions due to network issues. This is done by actively disconnecting from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions.

    It is possible to turn off the new behavior by setting the the kernel configuration parameter prevent_overlapping_partitions to false. Doing this will retain the same behavior as in OTP 24 and earlier.

  • The format_status/2 callback for gen_server, gen_statem and gen_event has been deprecated in favor of the new format_status/1 callback.

    The new callback adds the possibility to limit and change many more things than the just the state.

  • The timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The timer:sleep/1 function now accepts an arbitrarily large integer.

Compiler

  • The maybe ... end construction as proposed in EEP-49 has been implemented. It can simplify complex code where otherwise deeply nested cases would have to be used.

    To enable maybe, give the option {enable_feature,maybe_expr} to the compiler. The exact option to use will change in a coming release candidate and then it will also be possible to use from inside the module being compiled.

  • When a record matching or record update fails, a {badrecord, ExpectedRecordTag} exception used to be raised. In this release, the exception has been changed to {badrecord, ActualValue}, where ActualValue is the value that was found instead of the expected record.
  • Add compile attribute -nifs() to empower compiler and loader with information about which functions may be overridden as NIFs by erlang:load_nif/2.
  • Improved and more detailed error messages when binary construction with the binary syntax fails. This applies both for error messages in the shell and for erl_error:format_exception/3,4.

Crypto

  • Add crypto:hash_equals/2 which is a constant time comparision of hashvalues.

Dialyzer

  • Optimize operations in the erl_types module. Parallelize the Dialyzer pass remote.
  • Added the missing_return and extra_return options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose as overspecs and underspecs.
  • Dialyzer now better understands the types for min/2, max/2, and erlang:raise/3. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use erlang:raise/3 could now need a spec with a no_return() return type to avoid an unwanted warning.

Misc

  • A new DEVELOPMENT HOWTO guide has been added that describes how to build and test Erlang/OTP when fixing bugs or developing new functionality.
  • Testing has been added to the Github actions run for each opened PR so that more bugs are caught earlier when bug fixes and new features are proposed.

For more details about new features and potential incompatibilities see

Permalink

Function/Variable Ambiguity in Elixir

Variable syntax is one of the big differences between Erlang and Elixir that I encountered when learning Elixir. Instead of having to start each variable name with an uppercase letter, a lowercase letter must be used. This change in syntax seems like an improvement - after all most mainstream programming languages require variables to start with a lowercase letter, and lowercase is generally easier to type. However, a deeper look at this syntax choice reveals some significant downsides that I want to present here.

The Problem

The example I’m using to illustrate this problem is from a blog post on variable shadowing in Elixir by Michael Stalker.

defmodule Shadowing do
  x = 5
  def x, do: x
  def x(x = 0), do: x
  def x(x), do: x(x - 1)
end

Without running the code, tell me what the return values of these three function calls are:

Shadowing.x()

Shadowing.x(0)

Shadowing.x(2)

No, really. Think about the code for a minute.

Now…are you positive your answers are right?

This code snippet is confusing because the variable names and function names are indistinguishable from each other. This is an ambiguity in scope and also an ambiguity in identifier type. It’s not clear whether the token x is the function name (an atom) or a variable (identified by the same sequence of characters). Both are identifiers, but unlike Erlang, function identifiers and variable identifiers look the same. Despite this the compiler doesn’t get confused and handles this code according to Elixir’s scoping rules.

Erlang Is Better

I translated the Elixir code above to Erlang. The functions in this Erlang module behave the exact same as the functions in the Elixir module above.

-module(shadowing).

-export([x/0, x/1]).

-define(X, 5).

x() -> x().
x(X) when X == 0 -> X;
x(X) -> x(X - 1).

With Erlang all the ambiguity is gone. We now have functions and variables that cannot be confused. All variables start with uppercase letters and all function names always start with a lowercase letter or are wrapped in single quotes. This makes it impossible to confuse the two. Granted this is not an apples to apples comparison because Erlang doesn’t have a module scope for variables so I used a macro for the model-level variable. But we still have a function and a variable that can no longer be confused.

Conclusion

Despite it’s rough edges Erlang syntax is unambiguous. This is a key advantage Erlang has over Elixir when it comes to syntax. Variables, functions, and all other data types are easily distinguishable. Keywords can be confused with other atoms but this is seldom a problem in practice. The list of keywords is short and easy to memorize but syntax highlighters highlight them in a specific color making memorization unnecessary most of the time.

Permalink

OTP 25 Release candidate 2

OTP 25-rc2

Erlang/OTP 25-rc2 is the second release candidate of three before the OTP 25.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues or by posting to Erlangforums or the mailing list erlang-questions@erlang.org.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-13.0-rc2/doc/. You can also install the latest release using kerl like this: kerl build 25.0-rc2 25.0-rc2.

Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Below are some highlights of the release:

Highlights rc2

stdlib

  • New function filelib:ensure_path/1 will ensure that all directories for the given path exists
  • New functions groups_from_list/2 and groups_from_list/3 in the maps module
  • New functions uniq/1 uniq/2 in the lists module

compiler, kernel, stdlib, syntax_tools

  • Added support for selectable features as described in EEP-60. Features can be enabled/disabled during compilation with options (ordinary and +term) to erlc as well as with directives in the file. Similar options can be used to erl for enabling/disabling features allowed at runtime. The new maybe expression EEP-49 is fully supported as the feature maybe_expr.

Highlights rc1

erts & jit

  • The JIT now works for 64-bit ARM processors.
  • The JIT now does type-based optimizations based on type information in the BEAM files.
  • Improved the JIT’s support for external tools like perf and gdb, allowing them to show line numbers and even the original Erlang source code when that can be found.

erts, stdlib, kernel

  • Users can now configure ETS tables with the {write_concurrency, auto} option. This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. The {decentralized_counters, true} option is enabled by default when {write_concurrency, auto} is active.

    Benchmark results comparing this option with the other ETS optimization options are available here: benchmarks.

  • To enable more optimizations, BEAM files compiled with OTP 21 and earlier cannot be loaded in OTP 25.
  • The signal queue of a process with the process flag message_queue_data=off_heap has been optimized to allow parallel reception of signals from multiple processes. This can improve performance when many processes are sending in parallel to one process. See benchmark.
  • The Erlang installation directory is now relocatable on the file system given that the paths in the installation’s RELEASES file are paths that are relative to the installations root directory.
  • A new option called short has been added to the functions erlang:float_to_list/2 and erlang:float_to_binary/2. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.
  • Introduction of quote/1 and unquote/1 functions in the uri_string module - a replacement for the deprecated functions http_uri:encode and http_uri:decode.
  • The new module peer supersedes the slave module. The slave module is now deprecated and will be removed in OTP 27.
  • global will now by default prevent overlapping partitions due to network issues. This is done by actively disconnecting from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions.

    It is possible to turn off the new behavior by setting the the kernel configuration parameter prevent_overlapping_partitions to false. Doing this will retain the same behavior as in OTP 24 and earlier.

  • The format_status/2 callback for gen_server, gen_statem and gen_event has been deprecated in favor of the new format_status/1 callback.

    The new callback adds the possibility to limit and change many more things than the just the state.

  • The timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The timer:sleep/1 function now accepts an arbitrarily large integer.

Compiler

  • The maybe ... end construction as proposed in EEP-49 has been implemented. It can simplify complex code where otherwise deeply nested cases would have to be used.

    To enable maybe, give the option {enable_feature,maybe_expr} to the compiler. The exact option to use will change in a coming release candidate and then it will also be possible to use from inside the module being compiled.

  • When a record matching or record update fails, a {badrecord, ExpectedRecordTag} exception used to be raised. In this release, the exception has been changed to {badrecord, ActualValue}, where ActualValue is the value that was found instead of the expected record.
  • Add compile attribute -nifs() to empower compiler and loader with information about which functions may be overridden as NIFs by erlang:load_nif/2.
  • Improved and more detailed error messages when binary construction with the binary syntax fails. This applies both for error messages in the shell and for erl_error:format_exception/3,4.

Crypto

  • Add crypto:hash_equals/2 which is a constant time comparision of hashvalues.

Dialyzer

  • Optimize operations in the erl_types module. Parallelize the Dialyzer pass remote.
  • Added the missing_return and extra_return options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose as overspecs and underspecs.
  • Dialyzer now better understands the types for min/2, max/2, and erlang:raise/3. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use erlang:raise/3 could now need a spec with a no_return() return type to avoid an unwanted warning.

Misc

  • A new DEVELOPMENT HOWTO guide has been added that describes how to build and test Erlang/OTP when fixing bugs or developing new functionality.
  • Testing has been added to the Github actions run for each opened PR so that more bugs are caught earlier when bug fixes and new features are proposed.

For more details about new features and potential incompatibilities see

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.