How can technology answer the remaining questions in fintech? Fintech Week London Panel discussion
I recently had to fix an Elixir service that was slow to start. I was able to pinpoint the issue with only a few commands and I want to share a couple of the things I learned.
In Elixir all dependencies are “applications”. The term “application” means something different than it does outside of Elixir. In Elixir an “application” is a set of modules and behaviors. Some of these applications define their own supervision trees and must be started by Application.start/2
before they can be used. When you start your Elixir service, either via Mix or a generated Elixir release, the dependencies you specified in your mix.exs
file will be started before your own code is started. If an application listed as a dependency is slow to start your application must wait until the dependency is running before it can be started.
While the behavior is simple it is recursive. Each application has its own set of dependencies that must be running before that application can be started, and some of those dependencies have dependencies of their own that must be running before they can start. This results in a dependency tree structure. To illustrate this with a little ASCII:
- your_app
- dependency_1
- hidden_dependency_1
- hidden_dependency_2
- dependency_2
- hidden_dependency_3
For this application, the Erlang VM would likely start these applications in this order:
hidden_dependency_3
dependency_2
hidden_dependency_2
hidden_dependency_1
dependency_1
your_app
The application I had to fix had a lot of dependencies. Profiling each application would be tedious and time-consuming, and I had a hunch there was probably a single dependency that was the problem. Turns out it’s pretty easy to write a little code that times the start up of each application.
Start an IEx shell the --no-start
flag so that the application is available but not yet loaded or started:
iex -S mix run --no-start
Then load this code into the shell:
defmodule StartupBenchmark do
def run(application) do
complete_deps = deps_list(application) # (1)
dep_start_times = Enum.map(complete_deps, fn(app) -> # (2)
case :timer.tc(fn() -> Application.start(app) end) do
{time, :ok} -> {time, app}
# Some dependencies like :kernel may have already been started, we can ignore them
{time, {:error, {:already_started, _}}} -> {time, app}
# Raise an exception if we get an non-successful return value
{time, error} -> raise(error)
end
end)
dep_start_times
|> Enum.sort() # (3)
|> Enum.reverse()
end
defp deps_list(app) do
# Get all dependencies for the app
deps = Application.spec(app, :applications)
# Recursively call to get all sub-dependencies
complete_deps = Enum.map(deps, fn(dep) -> deps_list(dep) end)
# Build a complete list of sub dependencies, with the top level application
# requiring them listed last, also remove any duplicates
[complete_deps, [app]]
|> List.flatten()
|> Enum.uniq()
end
end
To highlight the important pieces from this module:
Recursively get all applications that must be started in the order they need to be started in.
Start each application in order; timing each one.
Sort applications by start time so the slowest application is the first item in the list.
With this code finding applications that are slow to start is easy:
> StartupBenchmark.run(:your_app)
[
{6651969, :prometheus_ecto},
{19621, :plug_cowboy},
{14336, :postgrex},
{13598, :ecto_sql},
{5123, :yaml_elixir},
{3871, :phoenix_live_dashboard},
{1159, :phoenix_ecto},
{123, :prometheus_plugs},
{64, :ex_json_logger},
{56, :prometheus_phoenix},
{56, :ex_ops},
{36, :kernel},
...
]
These times are in microseconds so in this case prometheus_ecto
is taking 6.6 seconds to start. All other applications are taking less than 20 milliseconds to start and many of them are taking less than 1 millisecond to start. prometheus_ecto
is the culprit here.
With the code above I was able to identify prometheus_ecto
as the problem. With this information I was then able to use eFlambe and a few other tools to figure out why prometheus_ecto
was so slow and quickly fixed the issue.
I hope the snippet of code above will be helpful to some of you. If you like reading my blog posts please subscribe to my newsletter. I send emails out once a month with my latest posts.
Joe Armstrong wrote a blog post titled My favorite Erlang Program, which showed a very simple universal server written in Erlang:
universal_server()->receive{become,F}->F()end.
You could then write a small program that could fit this function F
:
factorial_server()->receive{From,N}->From!factorial(N),factorial_server()end.factorial(0)->1;factorial(N)->N*factorial(N-1).
If you had an already running universal server, such as you would by having called Pid = spawn(fun universal_server/0)
, you could then turn that universal server into the factorial server by calling Pid ! {become, fun factorial_server/0}.
Joe Armstrong had a way to get to the essence of a lot of concepts and to think about programming as a fun thing. Unfortunately for me, my experience with the software industry has left me more or less frustrated with the way things are, even if the way things are is for very good reasons. I really enjoyed programming Erlang professionally, but I eventually got sidetracked by other factors that would lead to solid, safe software—mostly higher level aspects of socio-technical systems, and I became SRE.
But a part of me still really likes dreaming about the days where I could do hot code loading over entire clusters—see A Pipeline Made of Airbags—and I kept thinking about how I could bring this back, but within the context of complex software teams running CI/CD, building containers, and running them in Kubernetes. This is no short order, because we now have decades of lessons telling everyone that you want your infrastructure to be immutable and declared in code.
I also have a decade of experience telling me a lot of what we've built is a frustrating tower of abstractions over shaky ground. I know I've experienced better, my day job is no longer about slinging my own code, and I have no pretense of respecting the tower of abstraction itself.
A weed is a plant considered undesirable in a particular situation, "a plant in the wrong place". Examples commonly are plants unwanted in human-controlled settings, such as farm fields, gardens, lawns, and parks. Taxonomically, the term "weed" has no botanical significance, because a plant that is a weed in one context is not a weed when growing in a situation where it is wanted.
Like the weeds that decide that the tiniest crack in a driveway is actually real cool soil to grow in, I've decided to do the best thing given the situation and bring back Erlang live code upgrades to modern CI/CD, containerized and kubernetized infrastructure.
If you want the TL:DR; I wrote the dandelion project, which shows how to go from an Erlang/OTP app, automate the generation of live code upgrade instructions with the help of pre-existing tools and CI/CD, generate a manifest file and store build artifacts, and write the necessary configuration to have Kubernetes run said containers and do automated live code upgrades despite its best attempts at providing immutable images. Then I pepper in some basic CI scaffolding to make live code upgrading a tiny bit less risky. This post describes how it all works.
A lot of "no downtime" deployments you'll find for Kubernetes are actually just rolling updates with graceful connection termination. Those are always worth supporting (even if it can be annoying to get right), but it has a narrow definition of downtime that's different from what we're aiming for here: no need to restart the application, dump and re-hydrate the state, nor to drop a single connection.
I wrote a really trivial application, nothing worth calling home about. It's a tiny TCP server with a bunch of acceptors where you can connect with netcat (nc localhost 8080
) and it just displays stuff. This is the bare minimum to show actual "no downtime": a single instance, changing its code definitions in a way that is directly observable to a client, without ever dropping a connection.
The application follows a standard release structure for Erlang, using Rebar3 as a build tool. Its supervision structure looks like this:
top level supervisor / \ connection acceptors supervisor supervisor | | connection acceptor workers pool (1 per conn)
The acceptors supervisor starts a TCP listen socket, which is passed to each worker in the acceptor pool. Upon each accepted connection, a connection worker is started and handed the final socket. The connection worker then sits in an event loop. Every second it sends in a small ASCII drawing of a dandelion, and for every packet of data it receives (coalesced or otherwise), it sends in a line containing its version.
A netcat session looks like this:
$ nc localhost 8080 @ \ | __\!/__ @ \ | __\!/__ > ping! vsn: 0.1.5 @ \ | __\!/__ ^C
Modern deployments are often done with containers and Kubernetes. I'm assuming you're familiar with both of these concepts, but if you want more information—in the context of Erlang—then Tristan Sloughter wrote a great article on Docker and Erlang and another one on using Kubernetes for production apps.
In this post, I'm interested in doing two things:
The trick here is deceptively simple, enough to think "that can't be a good idea." It probably isn't.
The plan? Use a regular container someone else maintains and just wedge my program's tarball in there. I can then use a sidecar to automate fetching updates and applying live code upgrades without Kubernetes knowing anything about it.
To understand how this can work, we first need to cover the basics of Erlang releases. A good overview of the Erlang Virtual Machine's structure is an article I've written in OTP at a high level, but that I can summarize here by describing the following layers, from lowest to highest:
kernel
and stdlib
that define the most basic libraries around list processing, distributed programs, and the core of "OTP", the general development framework in the language. These can be live upgraded, but pretty much nobody does thatYour own project is pretty much just your own applications ("libraries"), bundled with select standard library libraries and a copy of the Erlang system:
The end result of this sort of ordeal is that every Erlang project is pretty much people writing their own libraries (in blue in the drawing above), fetching bits from the Erlang base install (in red in the drawing above), and then using tools (such as Rebar3) to repackage everything into a brand new Erlang distribution. A detailed explanation of how this happens is also in Adopting Erlang.
Whatever you built on one system can be deployed on an equivalent system. If you built your app on a 64 bit linux—and assuming you used static libraries for OpenSSL or LibreSSL, or have equivalent ones on a target system—then you can pack a tarball, unpack it on the other host system, and get going. Those requirements don't apply to your code, only the standard library. If you don't use NIFs or other C extensions, your own Erlang code, once built, is fully portable.
The cool thing is that Erlang supports making a sort of "partial" release, where you take the Erlang/OTP part (red) and your own apps (blue) in the above image, only package your own app's and a sort of "figure out the Erlang/OTP part at run-time" instruction, and your application is going to be entirely portable across all platforms (Windows, Linux, BSD, MacOS) and supported architectures (x86, ARM32, ARM64, etc.)
I'm mentioning this because for the sake of this experiment, I'm running things locally on a M1 macbook air (ARM64), with MicroK8s (which runs a Linux/aarch64), but am using Github Actions, which are on an x86 linux. So rather than using a base ubuntu image and then needing to run the same sort of hardware family everywhere down the build chain, I'll be using an Erlang image from dockerhub to provide the ERTS and stdlib, and will then have the ability to make portable builds from either my laptop or github actions and deploy them onto any Kubernetes cluster—something noticeably nicer than having to deal with cross-compilation in any language.
The release definition for Dandelion accounts for the above factors, and looks like this:
{relx,[{release,{dandelion,"0.1.5"},% my release and its version (dandelion-0.1.5)[dandelion,% includes the 'dandelion' app and its depssasl]},% and the 'sasl' library, which is needed for% live code upgrades to work%% set runtime configuration values based on the environment in this file{sys_config_src,"./config/sys.config.src"},%% set VM options in this file{vm_args,"./config/vm.args"},%% drop source files and the ERTS, but keep debug annotations%% which are useful for various tools, including automated%% live code upgrade plugins{include_src,false},{include_erts,false},{debug_info,keep},{dev_mode,false}]}.
The release can be built by calling rebar3 release
and packaged by calling rebar3 tar
.
Take the resulting tarball, unpack it, and you'll get a bunch of directories: lib/
contains the build artifact for all libraries, releases/
contains metadata about the current release version (and the structure to store future and past versions when doing live code upgrades), and finally the bin/
directory contains a bunch of accessory scripts to load and run the final code.
Call bin/dandelion
and a bunch of options show up:
$ bin/dandelion Usage: dandelion [COMMAND] [ARGS] Commands: foreground Start release with output to stdout remote_console Connect remote shell to running node rpc [Mod [Fun [Args]]]] Run apply(Mod, Fun, Args) on running node eval [Exprs] Run expressions on running node stop Stop the running node restart Restart the applications but not the VM reboot Reboot the entire VM ... upgrade [Version] Upgrade the running release to a new version downgrade [Version] Downgrade the running release to a new version install [Version] Install a release uninstall [Version] Uninstall a release unpack [Version] Unpack a release tarball versions Print versions of the release available ...
So in short, your program's lifecycle can become:
bin/dandelion foreground
boots the app (use bin/dandelion console
to boot a version in a REPL)bin/dandelion remote_console
pops up a REPL onto the running app by using distributed ErlangIf you're doing the usual immutable infrastructure, that's it, you don't need much more. If you're doing live code upgrades, you then have a few extra steps:
releases/
bin/dandelion unpack <version>
and the Erlang VM will unpack the new tarball into its regular structurebin/dandelion install <version>
to get the Erlang VM in your release to start tracking the new version (without switching to it)bin/dandelion upgrade <version>
to apply the live code upgradeAnd from that point on, the new release version is live.
I've sort of papered over the complexity required to "give it some instructions about how to do its live code upgrade." This area is generally really annoying and complex. You first start with appup files, which contain instructions on upgrading individual libraries, which are then packaged into a relup which provides instructions for coordinating the overall upgrade.
If you're running live code upgrades on a frequent basis you may want to get familiar with these, but most people never bothered, and the vast majority of live code upgrades are done by people writing manual scripts to load specific modules.
A very nice solution that also exists is to use Luis Rascão's rebar3_appup_plugin
which will take two releases, compare their code, and auto-generate instructions on your behalf. By using it, most of the annoyances and challenges are automatically covered for you.
All you need to do is to make sure all versions are adequately bumped, do a few command line invocations, and package it up. This will be a prime candidate for automation soon in this post.
For now though, let's assume we'll just put the release in an S3 bucket that the kubernetes cluster has access to, and build our infrastructure on the Kubernetes side.
Let's escape the Erlang complexity and don our DevOps hat. We now want to run the code we assume has made it safely to S3. All of it beautifully holds into a single YAML file—which, granted, can't really be beautiful on its own. I use three containers in a single kubernetes pod:
All of these containers will share a 'release' directory, by using an EmptyDir volume. The bootstrap container will fetch the latest release and unpack it there, the dandelion-release
container will run it, and the sidecar will be able to interact over the network to manage live code upgrades.
The bootstrap pod runs first, and fetches the first (and current) release from S3. I'm doing so by assuming we'll have a manifest file (<my-s3-bucket>/dandelion-latest
) that contains a single version number that points to the tarball I want (<my-s3-bucket>/dandelion-<version>.tar.gz
). This can be done with a shell script:
#!/usr/bin/env bashset -euxo pipefail
RELDIR=${1:-/release}S3_URL="https://${BUCKET_NAME}.s3.${AWS_REGION}.amazonaws.com"TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)
wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "/tmp/${RELEASE}-${TAG}.tar.gz"
tar -xvf "/tmp/${RELEASE}-${TAG}.tar.gz" -C ${RELDIR}
rm "/tmp/${RELEASE}-${TAG}.tar.gz"
This fetches the manifest, grabs the tag, fetches the release, unpacks it, and deletes the old tarball. The dandelion-release
container, which will run our main app, can then just call the bin/dandelion
script directly:
#!/usr/bin/env bashset -euxo pipefail
RELDIR=${1:-/release}exec${RELDIR}/bin/${RELEASE} foreground
The sidecar is a bit more tricky, but can reuse the same mechanisms. Every time interval (or based on a feature flag or some server-sent signal), check the manifest, and apply the unpacking steps. Something a bit like:
#!/usr/bin/env bashset -euxo pipefail
RELDIR=${2:-/release}S3_URL="https://${BUCKET_NAME}.s3.${AWS_REGION}.amazonaws.com"CURRENT=$(${RELDIR}/bin/${RELEASE} versions | awk '$3=="permanent" && !vsn { vsn=$2 } $3=="current" { vsn=$2 } END { print vsn }')TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)if[["${CURRENT}" !="${TAG}"]];then
wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "${RELDIR}/releases/${RELEASE}-${TAG}.tar.gz"${RELDIR}/bin/${RELEASE} unpack ${TAG}${RELDIR}/bin/${RELEASE} install ${TAG}${RELDIR}/bin/${RELEASE} upgrade ${TAG}fi
Call this in a loop and you're good to go.
Now here's the fun bit: ConfigMaps are a Kubernetes thing that lets you take arbitrary metadata, and optionally use them as files into pods. This is how we get close to our universal container.
By declaring the three scripts above as a ConfigMap and mounting them in a /scripts
directory, we can then declare the 3 containers in a generic fashion:
initContainers:-name:dandelion-bootstrapimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/init-latest.sh# Regular containers run nextcontainers:-name:dandelion-releaseimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/boot-release.shports:-containerPort:8080hostPort:8080-name:dandelion-sidecarimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/update-loop.sh
The full file has more details, but this is essentially all we need. You could kubectl apply -f dandelion.yaml
and it would get going on its own. The rest is about providing a better developer experience.
What we have defined now is an expected format and procedure from Erlang's side to generate code and live upgrade instructions, and a wedge to make this usable within Kubernetes' own structure. This procedure is somewhat messy, and there are a lot of technical aspects that need to be coordinated to make this usable.
Now comes the time to work around providing a useful workflow for this.
Semver's alright. Most of the time I won't really care about it, though. I'll go read the changelog and see if whatever I depend on has changed or not. People will pick versions for whichever factor they want, and they'll very often put a small breaking change (maybe a bug fix!) as non-breaking because there's an intent being communicated by the version.
Here the semver semantics are not useful. I've just defined a workflow that mostly depends on whether the server can be upgraded live or not, with some minor variations. This operational concern is likely to be the main concern of engineers who would work on such an application daily, particularly since as someone deploying and maintaining server-side software, I mostly own the whole pipeline and always consider the main branch to be canonical.
As such, I should feel free to develop my own versioning scheme. Since I'm trying to reorient Dandelion's whole flow towards continuous live delivery, my versioning scheme should actually reflect and support that effort. I therefore introduce Smoothver (Smooth Versioning):
Given a version number RESTART.RELUP.RELOAD, increment the:
The version number now communicates the relative expected risk of a deployment in terms of disruptiveness, carries some meaning around the magnitude of change taking place, and can be leveraged by tooling.
For example:
>
for a >=
in a comparison)As with anything we do, the version bump may be wrong. But it at least carries a certain safety level in letting you know that a RESTART live code upgrade should absolutely not be attempted.
Engineers who get more familiar with live code upgrades will also learn some interesting lessons. For example, a RELUP change over a process that has tens of thousands of copies of itself may take a long long time to run and be worse than a rolling upgrade. An interesting thing you can do then is turn RELUP
changes (which would require calling code change instructions) into basic code reloads by pattern matching an old structure and converting it on each call, turning it into a somewhat stateless roll-forward affair.
That's essentially converting operational burdens into dirtier code, but this sort of thing is something you do all the time with database migrations (create a new table, double-write, write only to the new one, delete the old table) and that can now be done with running code.
For a new development workflow that tries to orient itself towards live code upgrades, Smoothver is likely to carry a lot more useful information than Semver would (and maybe could be nice for database migrations as well, since they share concerns).
I needed to introduce the versioning mechanism because the overall publication workflow will obey it. If you're generating a new release version that requires a RESTART bump, then don't bother generating live code upgrade instructions. If you're generating anything else, do include them.
I've decided to center my workflow around git tags. If you tag your release v1.2.3
, then v1.2.4
or v1.4.1
all do a live code upgrade, but v2.0.0
won't, regardless of which branch they go to. The CI script is not too complicated, and is in three parts:
That's really all there is to it. I'm assuming that if you wanted to have more environments, you could setup gitops by having more tags (staging-v1.2.3
, prod-v1.2.5
) and more S3 buckets or paths. But everything is assumed to be driven by these builds artifacts.
A small caveat here is that it's technically possible to generate upgrade instructions (appup files) that map from many to many versions: how to update to 1.2.0 from 1.0.0, 1.0.1, 1.0.2, and so on. Since I'm assuming a linear deployment flow here, I'm just ignoring that and always generating pairs from "whatever is in prod" to "whatever has been tagged". There are obvious race conditions in doing this, where two releases generated in parallel can specify upgrade rules from a shared release, but could be applied and rolled out in a distinct order.
Relying on the manifest and versions is requiring a few extra lines in the sidecar's update loop. They look at the version, and if it's a RESTART bump or an older release, they ignore it:
# Get the running versionCURRENT=$(${RELDIR}/bin/${RELEASE} versions | awk '$3=="permanent" && !vsn { vsn=$2 } $3=="current" { vsn=$2 } END { print vsn }')TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)if[["${CURRENT}" !="${TAG}"]];thenIS_UPGRADE=$(echo"$TAG$CURRENT"| awk -vFS='[. ]''($1==$4 && $2>$5) || ($1==$4 && $2>=$5 && $3>$6) {print 1; exit} {print 0}')if[[$IS_UPGRADE -eq 1]];then
wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "${RELDIR}/releases/${RELEASE}-${TAG}.tar.gz"${RELDIR}/bin/${RELEASE} unpack ${TAG}${RELDIR}/bin/${RELEASE} install ${TAG}${RELDIR}/bin/${RELEASE} upgrade ${TAG}fifi
There's some ugly awk logic, but I wanted to not host images. The script could be made a lot more solid by looking at whether we're bumping from the proper version to the next one, and in this it shares a sort of similar race condition to the generation step.
On the other hand, the install
step looks at the specified upgrade instructions and will refuse to apply itself (resulting in a sidecar crash) if a bad release is applied.
I figure that alerting on crashed sidecars could be used to drive further automation to ask to delete and replace the pods, resulting in a rolling upgrade. Alternatively, the error itself could be used to trigger a failure in liveness and/or readiness probes, and force-automate that replacement. This is left as an exercise to the reader, I guess. The beauty of writing prototypes is that you can just decide this to be out of scope and move on, and let someone who's paid to operationalize that stuff to figure out the rest.
Oh and if you just change the Erlang VM's version? That changes the kubernetes YAML file, and if you're using anything like helm or some CD system (like ArgoCD), these will take care of running the rolling upgrade for you. Similarly, annotating the chart with a label of some sort indicating the RESTART version will accomplish the same purpose.
You may rightfully ask whether it is a good idea to bring mutability of this sort to a containerized world. I think that using S3 artifacts isn't inherently less safe than a container registry, dynamic feature flags, or relying on encryption services or DNS records for functional application state. I'll leave it at that.
Versioning things is really annoying. Each OTP app and library, and each release needs to be versioned properly. And sometimes you change dependencies and these dependencies won't have relup instructions available but you didn't know and that would break your live code upgrade.
What we can do is add a touch of automation to catch the most obvious failure situations and warn developers early about these issues. I've done so by adding a quick relup CI step to all pull requests, by using a version check script that encodes most of that logic.
The other thing I started experimenting with was setting up some sort of test suite for live code upgrades:
# This here step is a working sample, but if you were to run a more# complex app with external dependencies, you'd also have to do a# more intricate multi-service setup here, e.g.:# https://github.com/actions/example-services-name:Run relup applicationworking-directory:erlangrun:|mkdir relupcitar -xvf "${{ env.OLD_TAR }}" -C relupci# use a simple "run the task in the background" setuprelupci/bin/dandelion daemonTAG=$(echo "${{ env.NEW_TAR }}" | sed -nr 's/^.*([0-9]+\.[0-9]+\.[0-9]+)\.tar\.gz$/\1/p')cp "${{ env.NEW_TAR }}" relupci/releases/relupci/bin/dandelion unpack ${TAG}relupci/bin/dandelion install ${TAG}relupci/bin/dandelion upgrade ${TAG}relupci/bin/dandelion versions
The one thing that would make this one a lot cooler is to write a small extra app or release that runs in the background while the upgrade procedure goes on. It could do things like:
By starting that process before the live upgrade and questioning it after, we could ensure that the whole process went smoothly. Additional steps could also look at logs to know if things were fine.
The advantage of adding CI here is that each pull request can take measures to ensure it is safely upgradable live before being merged to main, even if none of them are deployed right away. By setting that gate in place, engineers are getting a much shorter feedback loop asking them to think about live deployments.
I've run through a few iterations to test and check everything. I've set up microk8s on my laptop, ran kubectl -f apply dandelion.yaml
and showed that the pod was up and running fine:
$ kubectl -n dandelion get pods NAME READY STATUS RESTARTS AGE dandelion-deployment-648db88f44-49jl8 2/2 Running 0 25H
It is possible to run into one of the containers, log on onto a REPL, and see what is going on:
$ kubectl-ndandelionexec-i-tdandelion-deployment-648db88f44-49jl8-cdandelion-sidecar--/bin/bashroot@dandelion-deployment-648db88f44-49jl8:/#/release/bin/dandelionremote_consoleErlang/OTP25[erts-13.0.2][source][64-bit][smp:2:2][ds:2:2:10][async-threads:1][jit]EshellV13.0.2(abortwith^G)(dandelion@localhost)1>release_handler:which_releases().[{"dandelion","0.1.5",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.5","sasl-4.2"],permanent},{"dandelion","0.1.4",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.4","sasl-4.2"],old}]
This shows that the container had been running for a day, and already had two releases—it first booted on version 0.1.4 and had already gone through a bump to 0.1.5. I ran a small pull request changing the display (and messed up versioning, which CI caught!), merged it, tagged it v0.1.6
, and started listening to my Kubernetes cluster:
$ nc 192.168.64.2 8080 @ \ | __\!/__ ... @ \ | __\!/__ vsn? vsn: 0.1.5 @ \ | __\!/__ * @ \ | __\!/__ vsn? vsn: 0.1.6 * @ \ | __\!/__ ...
This shows me interrogating the app (vsn?
) and getting the version back, and without dropping the connection, having a little pappus floating in the air.
My REPL session was still live in another terminal:
(dandelion@localhost)2>release_handler:which_releases().[{"dandelion","0.1.6",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.6","sasl-4.2"],permanent},{"dandelion","0.1.5",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.5","sasl-4.2"],old},{"dandelion","0.1.4",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.4","sasl-4.2"],old}]
showing that the old releases are still around as well. And here we have it, an actual zero-downtime deploy in a kubernetes container.
Joe's favorite program could hold on a business card. Mine is maddening. But I think this is because Joe didn't care for the toolchains people were building and just wanted to do his thing. My version reflects the infrastructure we have put in place, and the processes we want and need for a team.
Rather than judging the scaffolding, I'd invite you to think about what would change when you start centering your workflow around a living system.
Those of you who have worked with bigger applications that have a central database or shared schemas around network protocols (or protobuf files or whatever) know that you approach your work differently when you have to consider how it's going to be rolled out. It impacts your tooling, how you review changes, how you write them, and ultimately just changes how you reason about your code and changes.
In many ways it's a more cumbersome way to deploy and develop code, but you can also think of the other things that change: what if instead of having configuration management systems, you could hard-code your config in constants that just get rolled out live in less than a minute—and all your configs were tested as well as your code? Since all the release upgrades implicitly contain a release downgrade instruction set, just how much faster could you rollback (or automate rolling back) a bad deployment? Would you be less afraid of changing network-level schema definitions if you made a habit of changing them within your app? How would your workflow change if deploying took half-a-second and caused absolutely no churn nor disruption to your cluster resources most of the time?
Whatever structure we have in place guides a lot of invisible emergent behaviour, both in code and in how we adjust ourselves to the structure. Much of what we do is a tacit response to our environment. There's a lot of power in experimenting alternative structures, and seeing what pops up at the other end. A weed is only considered as such in some contexts. This is a freak show of a deployment mechanism, but it sort of works, and maybe it's time to appreciate the dandelions for what they can offer.
Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities.
For details about new features, bugfixes and potential incompatibilities see the Erlang 25.0 README or the Erlang/OTP 25.0 downloads page.
Many thanks to all contributors!
filelib:ensure_path/1
will ensure that all directories for the given path existsgroups_from_list/2
and groups_from_list/3
in the maps
moduleuniq/1
uniq/2
in the lists
modulerand
module, for fast pseudo-random numers.EEP-60
.
Features can be enabled/disabled during compilation with options
(ordinary and +term) to erlc
as well as with directives in the file.
Similar options can be used to erl
for enabling/disabling features
allowed at runtime. The new maybe
expression EEP-49
is fully supported as the feature maybe_expr.perf
and gdb
, allowing them to show line numbers and even
the original Erlang source code when that can be found.{write_concurrency, auto}
option. This option forces
tables to automatically change the number of locks that
are used at run-time depending on how much concurrency
is detected. The {decentralized_counters, true}
option
is enabled by default when {write_concurrency, auto}
is
active.message_queue_data=off_heap
has been optimized to
allow parallel reception of signals from multiple processes.
This can improve performance when many processes are sending in parallel to
one process. See benchmark.short
has been added to the
functions erlang:float_to_list/2
and
erlang:float_to_binary/2
. This option creates the
shortest correctly rounded string representation of the
given float that can be converted back to the same
float again.quote/1
and unquote/1
functions in
the uri_string
module - a replacement for the deprecated functions http_uri:encode
and http_uri:decode
.peer
supersedes the slave
module. The
slave
module is now deprecated and will be removed in OTP 27.global
will now by default prevent
overlapping partitions due to network issues. This is done by
actively disconnecting from nodes that reports that
they have lost connections to other nodes. This will
cause fully connected partitions to form instead of
leaving the network in a state with overlapping
partitions.kernel
configuration parameter prevent_overlapping_partitions
to false
.
Doing this will retain the same behavior as in OTP 24 and earlier.format_status/2
callback for gen_server
, gen_statem
and gen_event
has been deprecated in favor of the new
format_status/1
callback.timer
module has been modernized and made more
efficient, which makes the timer server less
susceptible to being overloaded. The timer:sleep/1
function now accepts an arbitrarily large integer.maybe ... end
construction as proposed in EEP-49
has been implemented. It can simplify complex code
where otherwise deeply nested cases would have to be
used.maybe
, give the option {enable_feature,maybe_expr}
to
the compiler. The exact option to use will change in a coming release candidate
and then it will also be possible to use from inside the module being compiled.{badrecord, ExpectedRecordTag}
exception used to be
raised. In this release, the exception has been changed
to {badrecord, ActualValue}
, where ActualValue
is the
value that was found instead of the expected record.-nifs()
to empower compiler and loader with
information about which functions may be overridden as NIFs by erlang:load_nif/2
.erl_error:format_exception/3,4
.-compile(..)
directive now has the format {feature,
feature-name, enable | disable}
. The -feature(..)
now
has the format -feature(feature-name, enable | disable)
.crypto:hash_equals/2
which is a constant time comparision of hashvalues.{certs_keys,[cert_key_conf()]}
.
With this a list of a certificates with their associated key may be
used to authenticate the client or the server. The
certificate key pair that is considered best and matches
negotiated parameters for the connection will be selected.erl_types
module. Parallelize the Dialyzer pass remote.missing_return
and extra_return
options to
raise warnings when specifications differ from inferred
types. These are similar to, but not quite as verbose
as overspecs and underspecs.min/2
,
max/2
, and erlang:raise/3
. Because of that, Dialyzer
can potentially generate new warnings. In particular,
functions that use erlang:raise/3
could now need a spec
with a no_return()
return type to avoid an unwanted
warning.Download links for this and previous versions are found here
Over a year ago, I left the role of software engineering* behind to become a site reliability engineer* at Honeycomb.io. Since then, I've been writing a bunch of blog posts over there rather than over here, including the following:
There are also a couple of incident reviews, including one on a Kafka migration and another on a spate of scaling-related incidents.
Either way, I only have so much things to rant about to fill in two blogs, so this place here has been a bit calmer. However, I recently gave a talk at IRConf (video).
I am reproducing this talk here because, well, I stand behind the content, but it would also not fit the work blog's format. I'm also taking this opportunity because I don't know how many talks I'll give in the next few years. I've decided to limit how much traveling I do for conferences due to environmental concerns—if you see me at a conference, it either overlapped with other opportunistic trips either for work or vacations, or they were close enough for me to attend them via less polluting means—and so I'd like to still post some of the interesting talks I have when I can.
This talk is first of all about the idea that errors are not objective truths. Even if we look at objective facts with a lot of care, errors are arbitrary interpretations that we come up with, constructions that depend on the perspective we have. Think of them the same way constellations in the night sky are made up of real stars, but their shape and meaning are made up based on our point of view and what their shapes remind us of.
The other thing this talk will be about is ideas about what we can do once we accept this idea, to figure out the sort of changes and benefits we can get from our post-incident process when we adjust to it.
I tend to enjoy incidents a lot. Most of the things in this talk aren't original ideas, they're things I've read and learned from smarter, more experienced people, and that I've put back together after digesting them for a long time. In fact, I thought my title for this talk was clever, but as I found out by accident a few days ago, it's an almost pure paraphrasing of a quote in a book I've read over 3 years ago. So I can't properly give attribution for all these ideas because I don't know where they're from anymore, and I'm sorry about that.
This is a quote from "Those found responsible have been sacked": some observations on the usefulness of error" that I'm using because even if errors are arbitrary constructions, they carry meaning, and they are useful to organizations. The paper defines four types I'll be paraphrasing:
So generally, error is useful as a concept, but as an investigator it is particularly useful as a signal to tell you when things get interesting, not as an explanation on their own.
And so this sort of makes me think about how a lot of incident reviews tend to go on. We use the incident as an opportunity because the disruption is big and large enough to let us think about it all. But the natural framing that easily comes through is to lay blame to the operational area.
Here I don't mean blame as in "people fucked up" nearly as much as "where do we think the organisation needs to improve"—where do we think that as a group we need to improve as a result of this. The incident and the operations are the surface, they often need improvement for sure because it is really tricky work done in special circumstances and it's worth constantly adjusting it, but stopping there is missing on a lot of possible content that could be useful.
People doing the operations are more or less thrown in a context where a lot of big decisions have been made already. Whatever was tested, who was hired, what the budgets are, and all these sorts of pressures are in a large part defined by the rest of the organization, they matter as well. They set the context in which operations take place.
So one question then, is how do we go from that surface-level vision, and start figuring out what happens below that organisational waterline.
The first step of almost any incident investigation is to start with a timeline. Something that lets us go back from the incident or its resolution, and that we use as breadcrumbs to lead us towards ways to prevent this sort of thing from happening again. So we start at the bottom where things go boom, and walk backwards from there.
The usual frameworks we're familiar with apply labels to these common patterns. We'll call something a failure or an error. The thing that happened right before it tends to be called a proximate cause, which is frequently used in insurance situations: it's the last event in the whole chain that could have prevented the failure from happening. Then we walk back. Either five times because we're doing the five whys, or until you land at a convenient place. If there is a mechanical or software component you don't like, you're likely to highlight its flaws there. If it's people or teams you don't trust as much, you may find human error there.
Even the concept of steady state is a shaky one. Large systems are always in some weird degraded state. In short, you find what you're looking for. The labels we use, the lens through which we look at the incident, influence the way we build our explanations.
The overall system is not any of the specific lenses, though, it's a whole set of interactions. To get a fuller richer picture, we have to account for what things looked liked at the time, not just our hindsight-fuelled vision when looking behind. There are a lot of things happening concurrently, a lot of decisions made to avoid bad situations that never took place, and some that did.
Hindsight bias is something somewhat similar to outcome bias, which essentially says that because we know there was a failure, every decision we look at that has taken place before the incident will seem to us like it should obviously have appeared as risky and wrong. That's because we know the result, it affects our judgment. But when people were going down that path and deciding what to do, they were trying to do a good job; they were making the best calls they could to the extent of their abilities and the information available at the time.
We can't really avoid hindsight bias, but we can be aware of it. One tip there is to look at what was available at the time, and consider the signals that were available to people. If they made a decision that looks weird, then look for what made it look better than the alternatives back then.
Counterfactuals are another thing to avoid, and they're one of the trickiest ones to eliminate from incident reviews. They essentially are suppositions about things that have never happened and will never happen. Whenever we say "oh, if we had done this at this point in time instead of that, then the incident would have been prevented." They're talking about a fictional universe that never happened, and they're not productive.
I find it useful to always cast these comments into the future: "next time this happens, we should try to try that to prevent an issue." This orients the discussion towards more realistic means: how can we make this option more likely? the bad one less interesting? In many cases, a suggestion will even become useless: by changing something else in the system, a given scenario may no longer be a concern for the future, or it may highlight how a possible fix would in fact create more confusion.
Finally, normative judgments. Those are often close to counterfactuals, but you can spot them because they tend to be about what people should or shouldn't have done, often around procedures or questions of competence. "The engineer should have checked the output more carefully, and they shouldn't have run the command without checking with someone else, as stated in the procedure." Well they did because it arguably looked reasonable at the time!
The risk with a counterfactual judgment is that it assumes that the established procedure is correct and realistically applicable to the situation at hand. It assumes that deviations and adjustments made by the responder are bound to fail even if we'll conveniently ignore all the times they work. We can't properly correct procedures if we think they're already perfect and it's wrong not to obey them, and we can't improve tooling if we believe the problem is always the person holding it wrong.
A key factor is to understand that in high pressure incident responses, failure and successes use the same mechanisms. We're often tired, distracted, or possibly thrown in there without adequate preparation. What we do to try and make things go right and often succeed through is also in play when things go wrong.
People look for signals, and have a background and training that influences the tough calls that usually will be shared across situations. We tend to want things to go right. The outcome tends to define whether the decision was good one or not, but the decision-making mechanism is shared both for decisions that go well and those that do not. And so we need to look at how these decisions are made with the best of intentions to have any chance of improving how events unfold the next time.
This leads to the idea you want to look at what's not visible, because they show real work.
I say this is "real work" because we come in to a task with an understanding of things, a sort of mental model. That mental model is the rolled up experience we have, and lets us frame all the events we encounter, and is the thing we use to predict the consequences of our decisions.
When we are in an incident, there's almost always a surprise in there, which means that the world and our mental model are clashing. This mismatch between our understanding of the world and the real world was already there. That gap between both needs to be closed, and the big holes in an incident's timelines tend to be one of the places where this takes place.
Whenever someone reports "nothing relevant happens here", these are generally the places where active hypothesis generation periods happen, where a lot of the repair and gap bridging is taking place.
This is where the incident can become a very interesting window into the whole organizational iceberg below the water line.
So looking back at the iceberg, looking at how decisions are made in the moment lets you glimpse at the values below the waterline that are in play. What are people looking at, how are they making their decisions. What's their perspective. These are impacted by everything else that happens before.
If you see concurrent outages or multiple systems impacted, digging into which one gets resolved first and why that is is likely to give you insights about what responders feel confident about, the things they believe are more important to the organization and users. They can reflect values and priorities.
If you look at who they ask help from and where they look for information (or avoid looking for it), this will let you know about various dynamics, social and otherwise, that might be going on in your organization. This can be because some people are central points of knowledge, others are jerks, seen as more or less competent, or also be about what people believe the state of documentation is at that point in time.
And this is why changing how we look at and construct errors matters. If we take the straightforward causal approach, we'll tend to only skim the surface. Looking at how people do their jobs and how they make decisions is an effective way to go below that waterline, and have a much broader impact than staying above water.
To take a proper dive, it helps to ask the proper type of questions. As a facilitator, your job is to listen to what people tell you, but there are ways to prompt for more useful information. The Etsy debriefing facilitation guide is a great source, and so is Jeli's Howie guide. The slide contains some of the questions I like to ask most.
There's one story I recall from a previous job where a team had specifically written an incident report on an outage with some service X, and the report had that sort of 30 minutes gap in it and were asking for feedback on it. I instantly asked "so what was going on during this time?" Only for someone on the team to answer "oh, we were looking for the dashboard of service Y". I asked why they had been looking at the dashboard of another service, and he said that the the service's own dashboard isn't trustworthy, and that this one gave a better picture of the health of the service through its effects. And just like that we opened new paths for improvements that were so normal it had become invisible.
Another one also came from a previous job where an engineer kept accidentally deleting production databases and triggering a whole disaster recovery response. They were initially trying to delete a staging database that was dynamically generated for test cases, but kept fat-fingering the removal of production instances in the AWS console. Other engineers were getting mad and felt that person was being incompetent, and were planning to remove all of their AWS console permissions because there also existed an admin tool that did the same thing safely by segmenting environments.
I ended up asking the engineer if there was anything that made them choose the AWS console more than the admin tool given the difference in safety, and they said, quite simply, that the AWS console has an autocomplete and they never remembered the exact table name, so it was just much faster to delete that table often there than the admin. This was an interesting one because instead of blaming the engineer for being incompetent, it opened the door to questioning the gap in tooling rather than adding more blockers and procedures.
In both of these stories, a focus on how people were making their decisions and their direct work experience managed to highlight alternative views that wouldn't have come up otherwise. They can generate new, different insights and action items.
And this is the sort of map that, when I have time for it, I tried to generate at Honeycomb. It's non-linear, and the main objective is to help show different patterns about the response. Rather than building a map by categorizing events within a structure, the idea is to lay the events around to see what sort of structure pops up. And then we can walk through the timeline and ask what we were thinking, feeling, or seeing.
The objective is to highlight challenging bits and look at the way we work in a new light. Are there things we trust, distrust? Procedures that don't work well, bits where we feel lost? Focusing on these can improve response in the future.
This idea of focusing on generating rather than categorizing is intended to take an approach that is closer to Qualitative science than Quantitative research.
The way we structure our reviews will have a large impact on how we construct errors. I tend to favour a qualitative approach to a quantitative one.
A quantitative approach will often look at ways to aggregate data, and create ways to compare one incident to the next. They'll measure things such as the Mean Time To Recovery (MTTR), the impact, the severity, and will look to assign costs and various classifications. This approach will be good to highlight trends and patterns across events, but as far as I can tell, they won't necessarily provide a solid path for practical improvements for any of the issues found.
The qualitative approach by comparison aims to do a deep dive to provide more complete understanding. It tends to be more observational and generative. Instead of cutting up the incidents and classifying its various parts, we look at what was challenging, what are the things people felt were significant during the incident, and all sorts of messy details. These will highlight tricky dynamics, both for high-pace outages and wider organizational practices, and are generally behind the insights that help change things effectively.
To put this difference in context, I have an example from a prior jobs, where one of my first mandates was to try and help with their reliability story. We went over 30 or so incident reports that had been written over the last year, and a pattern that quickly came up was how many reports mentioned "lack of tests" (or lack of good tests) as causes, and had "adding tests" in action items.
By looking at the overall list, our initial diagnosis was that testing practices were challenging. We thought of improving the ergonomics around tests (making them faster) and to also provide training in better ways to test. But then we had another incident where the review reported tests as an issue, so I decided to jump in.
I reached out to the engineers in question and asked about what made them feel like they had enough tests. I said that we often write tests up until the point we feel they're not adding much anymore, and that I was wondering what they were looking at, what made them feel like they had reached the points where they had enough tests. They just told me directly that they knew they didn't have enough tests. In fact, they knew that the code was buggy. But they felt in general that it was safer to be on-time with a broken project than late with a working one. They were afraid that being late would put them in trouble and have someone yell at them for not doing a good job.
And so that revealed a much larger pattern within the organization and its culture. When I went up to upper management, they absolutely believed that engineers were empowered and should feel safe pressing a big red button that stopped feature work if they thought their code wasn't ready. The engineers on that team felt that while this is what they were being told, in practice they'd still get in trouble.
There's no amount of test training that would fix this sort of issue. The engineers knew they didn't have enough tests and they were making that tradeoff willingly.
So to conclude on this, the focus should be on understanding the mess:
Overall, the idea is that looking for understanding more than causes opens up a lot of doors and makes incidents more valuable.
* I can't legally call myself a software engineer, and technically neither can I be a site reliability engineer, because Quebec considers engineering to be a protected discipline. I however, do not really get to tell American employers what they should give as a job title to people, so I get stuck having titles I can't legally advertise but for which no real non-protected forms exist to communicate. So anywhere you see me referred to any sort of "engineer", that's not an official thing I would choose as a title. It'd be nice to know what the non-engineer-titled equivalent of SRE ought to be.
Erlang/OTP 25-rc3 is the third and final release candidate before the OTP 25.0 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues or by posting to Erlangforums or the mailing list erlang-questions@erlang.org.
All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-13.0-rc3/doc/.
You can also install the latest release using kerl like this: kerl build 25.0-rc3 25.0-rc3
.
Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.
Many thanks to all contributors!
Below are some highlights of the release:
Change format of feature options and directives for
better consistency. Options to erlc and the
-compile(..)
directive now has the format {feature,
feature-name, enable | disable}
. The -feature(..)
now
has the format -feature(feature-name, enable |
disable)
.
Introducing a new (still experimental) option {certs_keys
,[cert_key_conf()]}.
With this a list of a certificates with their associated key may be
used to authenticate the client or the server. The
certificate key pair that is considered best and matches
negotiated parameters for the connection will be selected.
filelib:ensure_path/1
will ensure that all directories for the given path existsgroups_from_list/2
and groups_from_list/3
in the maps
moduleuniq/1
uniq/2
in the lists moduleEEP-60
. Features can be enabled/disabled during compilation with options (ordinary and +term) to erlc
as well as with directives in the file. Similar options
can be used to erl
for enabling/disabling features
allowed at runtime. The new maybe
expression EEP-49
is fully supported as the feature maybe_expr.perf
and gdb
, allowing them to show line numbers and even
the original Erlang source code when that can be found.Users can now configure ETS tables with the
{write_concurrency, auto}
option. This option forces
tables to automatically change the number of locks that
are used at run-time depending on how much concurrency
is detected. The {decentralized_counters, true}
option
is enabled by default when {write_concurrency, auto}
is
active.
Benchmark results comparing this option with the other ETS optimization options are available here: benchmarks.
message_queue_data=off_heap
has been optimized to
allow parallel reception of signals from multiple processes.
This can improve performance when many processes are sending in parallel to
one process. See benchmark.short
has been added to the
functions erlang:float_to_list/2
and
erlang:float_to_binary/2
. This option creates the
shortest correctly rounded string representation of the
given float that can be converted back to the same
float again.quote/1
and unquote/1
functions in
the uri_string
module - a replacement for the deprecated functions http_uri:encode
and http_uri:decode
.peer
supersedes the slave
module. The
slave
module is now deprecated and will be removed in OTP 27.global
will now by default prevent
overlapping partitions due to network issues. This is done by
actively disconnecting from nodes that reports that
they have lost connections to other nodes. This will
cause fully connected partitions to form instead of
leaving the network in a state with overlapping
partitions.
It is possible to turn off the new behavior by setting the
the kernel
configuration parameter prevent_overlapping_partitions
to false
.
Doing this will retain the same behavior as in OTP 24 and earlier.
The format_status/2
callback for gen_server
, gen_statem
and gen_event
has been deprecated in favor of the new
format_status/1
callback.
The new callback adds the possibility to limit and change many more things than the just the state.
timer
module has been modernized and made more
efficient, which makes the timer server less
susceptible to being overloaded. The timer:sleep/1
function now accepts an arbitrarily large integer.The maybe ... end
construction as proposed in EEP-49
has been implemented. It can simplify complex code
where otherwise deeply nested cases would have to be
used.
To enable maybe
, give the option {enable_feature,maybe_expr}
to
the compiler. The exact option to use will change in a coming release candidate and then it will also be possible to
use from inside the module being compiled.
{badrecord, ExpectedRecordTag}
exception used to be
raised. In this release, the exception has been changed
to {badrecord, ActualValue}
, where ActualValue
is the
value that was found instead of the expected record.-nifs()
to empower compiler and loader with
information about which functions may be overridden as NIFs by erlang:load_nif/2
.erl_error:format_exception/3,4
.crypto:hash_equals/2
which is a constant time comparision of hashvalues.erl_types
module. Parallelize the Dialyzer pass remote.missing_return
and extra_return
options to
raise warnings when specifications differ from inferred
types. These are similar to, but not quite as verbose
as overspecs and underspecs.min/2
,
max/2
, and erlang:raise/3
. Because of that, Dialyzer
can potentially generate new warnings. In particular,
functions that use erlang:raise/3
could now need a spec
with a no_return()
return type to avoid an unwanted
warning.For more details about new features and potential incompatibilities see
Variable syntax is one of the big differences between Erlang and Elixir that I encountered when learning Elixir. Instead of having to start each variable name with an uppercase letter, a lowercase letter must be used. This change in syntax seems like an improvement - after all most mainstream programming languages require variables to start with a lowercase letter, and lowercase is generally easier to type. However, a deeper look at this syntax choice reveals some significant downsides that I want to present here.
The example I’m using to illustrate this problem is from a blog post on variable shadowing in Elixir by Michael Stalker.
defmodule Shadowing do x = 5 def x, do: x def x(x = 0), do: x def x(x), do: x(x - 1) end
Without running the code, tell me what the return values of these three function calls are:
Shadowing.x()
Shadowing.x(0)
Shadowing.x(2)
No, really. Think about the code for a minute.
Now…are you positive your answers are right?
This code snippet is confusing because the variable names and function names are indistinguishable from each other. This is an ambiguity in scope and also an ambiguity in identifier type. It’s not clear whether the token x
is the function name (an atom) or a variable (identified by the same sequence of characters). Both are identifiers, but unlike Erlang, function identifiers and variable identifiers look the same. Despite this the compiler doesn’t get confused and handles this code according to Elixir’s scoping rules.
I translated the Elixir code above to Erlang. The functions in this Erlang module behave the exact same as the functions in the Elixir module above.
-module(shadowing).
-export([x/0, x/1]).
-define(X, 5).
x() -> x().
x(X) when X == 0 -> X;
x(X) -> x(X - 1).
With Erlang all the ambiguity is gone. We now have functions and variables that cannot be confused. All variables start with uppercase letters and all function names always start with a lowercase letter or are wrapped in single quotes. This makes it impossible to confuse the two. Granted this is not an apples to apples comparison because Erlang doesn’t have a module scope for variables so I used a macro for the model-level variable. But we still have a function and a variable that can no longer be confused.
Despite it’s rough edges Erlang syntax is unambiguous. This is a key advantage Erlang has over Elixir when it comes to syntax. Variables, functions, and all other data types are easily distinguishable. Keywords can be confused with other atoms but this is seldom a problem in practice. The list of keywords is short and easy to memorize but syntax highlighters highlight them in a specific color making memorization unnecessary most of the time.
Erlang/OTP 25-rc2 is the second release candidate of three before the OTP 25.0 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues or by posting to Erlangforums or the mailing list erlang-questions@erlang.org.
All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-13.0-rc2/doc/.
You can also install the latest release using kerl like this: kerl build 25.0-rc2 25.0-rc2
.
Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.
Many thanks to all contributors!
Below are some highlights of the release:
filelib:ensure_path/1
will ensure that all directories for the given path existsgroups_from_list/2
and groups_from_list/3
in the maps
moduleuniq/1
uniq/2
in the lists moduleEEP-60
. Features can be enabled/disabled during compilation with options (ordinary and +term) to erlc
as well as with directives in the file. Similar options
can be used to erl
for enabling/disabling features
allowed at runtime. The new maybe
expression EEP-49
is fully supported as the feature maybe_expr.perf
and gdb
, allowing them to show line numbers and even
the original Erlang source code when that can be found.Users can now configure ETS tables with the
{write_concurrency, auto}
option. This option forces
tables to automatically change the number of locks that
are used at run-time depending on how much concurrency
is detected. The {decentralized_counters, true}
option
is enabled by default when {write_concurrency, auto}
is
active.
Benchmark results comparing this option with the other ETS optimization options are available here: benchmarks.
message_queue_data=off_heap
has been optimized to
allow parallel reception of signals from multiple processes.
This can improve performance when many processes are sending in parallel to
one process. See benchmark.short
has been added to the
functions erlang:float_to_list/2
and
erlang:float_to_binary/2
. This option creates the
shortest correctly rounded string representation of the
given float that can be converted back to the same
float again.quote/1
and unquote/1
functions in
the uri_string
module - a replacement for the deprecated functions http_uri:encode
and http_uri:decode
.peer
supersedes the slave
module. The
slave
module is now deprecated and will be removed in OTP 27.global
will now by default prevent
overlapping partitions due to network issues. This is done by
actively disconnecting from nodes that reports that
they have lost connections to other nodes. This will
cause fully connected partitions to form instead of
leaving the network in a state with overlapping
partitions.
It is possible to turn off the new behavior by setting the
the kernel
configuration parameter prevent_overlapping_partitions
to false
.
Doing this will retain the same behavior as in OTP 24 and earlier.
The format_status/2
callback for gen_server
, gen_statem
and gen_event
has been deprecated in favor of the new
format_status/1
callback.
The new callback adds the possibility to limit and change many more things than the just the state.
timer
module has been modernized and made more
efficient, which makes the timer server less
susceptible to being overloaded. The timer:sleep/1
function now accepts an arbitrarily large integer.The maybe ... end
construction as proposed in EEP-49
has been implemented. It can simplify complex code
where otherwise deeply nested cases would have to be
used.
To enable maybe
, give the option {enable_feature,maybe_expr}
to
the compiler. The exact option to use will change in a coming release candidate and then it will also be possible to
use from inside the module being compiled.
{badrecord, ExpectedRecordTag}
exception used to be
raised. In this release, the exception has been changed
to {badrecord, ActualValue}
, where ActualValue
is the
value that was found instead of the expected record.-nifs()
to empower compiler and loader with
information about which functions may be overridden as NIFs by erlang:load_nif/2
.erl_error:format_exception/3,4
.crypto:hash_equals/2
which is a constant time comparision of hashvalues.erl_types
module. Parallelize the Dialyzer pass remote.missing_return
and extra_return
options to
raise warnings when specifications differ from inferred
types. These are similar to, but not quite as verbose
as overspecs and underspecs.min/2
,
max/2
, and erlang:raise/3
. Because of that, Dialyzer
can potentially generate new warnings. In particular,
functions that use erlang:raise/3
could now need a spec
with a no_return()
return type to avoid an unwanted
warning.For more details about new features and potential incompatibilities see