Erlang/OTP 27.0 Release Candidate 1

OTP 27.0-rc1

Erlang/OTP 27.0-rc1 is the first release candidate of three before the OTP 27.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc1/doc. You can also install the latest release using kerl like this:

kerl build 27.0-rc1 27.0-rc1.

Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Highlights

Documentation

EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.

The entire Erlang/OTP documentation is now using the new documentation system.

New language features

  • Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.

  • Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.

  • Sigils on string literals (both ordinary and triple-quoted) have been implemented as per EEP 66. For example, ~"Björn" or ~b"Björn" are now equivalent to <<"Björn"/utf8>>.

Compiler and JIT improvements

  • The compiler will now merge consecutive updates of the same record.

  • Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.

  • The maybe expression is now enabled by default, eliminating the need for enabling the maybe_expr feature.

  • Native coverage support has been implemented in the JIT. It will automatically be used by the cover tool to reduce the execution overhead when running cover-compiled code. There are also new APIs to support native coverage without using the cover tool.

  • The compiler will now raise a warning when updating record/map literals to catch a common mistake. For example, the compiler will now emit a warning for #r{a=1}#r{b=2}.

ERTS

  • The erl command now supports the -S flag, which is similar to the -run flag, but with some of the rough edges filed off.

  • By default, escripts will now be compiled instead of interpreted. That means that the compiler application must be installed.

  • The default process limit has been raised to 1048576 processes.

  • The erlang:system_monitor/2 functionality is now able to monitor long message queues in the system.

  • The obsolete and undocumented support for opening a port to an external resource by passing an atom (or a string) as first argument to open_port(), implemented by the vanilla driver, has been removed. This feature has been scheduled for removal in OTP 27 since the release of OTP 26.

  • The pid field has been removed from erlang:fun_info/1,2.

  • Multiple trace sessions are now supported.

STDLIB

  • Several new functions that accept funs have been added to module timer.

  • The functions is_equal/2, map/2, and filtermap/2 have been added to the modules sets, ordsets, and gb_sets.

  • There are new efficient ets traversal functions with guaranteed atomicity. For example, ets:next/2 followed by ets:lookup/2 can now be replaced with ets:next_lookup/1.

  • The new function ets:update_element/4 is similar to ets:update_element/3, but takes a default tuple as the fourth argument, which will be inserted if no previous record with that key exists.

  • binary:replace/3,4 now supports using a fun for supplying the replacement binary.

  • The new function proc_lib:set_label/1 can be used to add a descriptive term to any process that does not have a registered name. The name will be shown by tools such as c:i/0 and observer, and it will be included in crash reports produced by processes using gen_server, gen_statem, gen_event, and gen_fsm.

  • Added functions to retrieve the next higher or lower key/element from gb_trees and gb_sets, as well as returning iterators that start at given keys/elements.

common_test

  • Calls to ct:capture_start/0 and ct:capture_stop/0 are now synchronous to ensure that all output is captured.

  • The default CSS will now include a basic dark mode handling if it is preferred by the browser.

crypto

  • The functions crypto_dyn_iv_init/3 and crypto_dyn_iv_update/3 that were marked as deprecated in Erlang/OTP 25 have been removed.

dialyzer

  • The --gui option for Dialyzer has been removed.

ssl

  • The ssl client can negotiate and handle certificate status request (OCSP stapling support on the client side).

tools

  • There is a new tool tprof, which combines the functionality of eprof and cprof under one interface. It also adds heap profiling.

xmerl

  • As an alternative to xmerl_xml, a new export module xmerl_xml_indent that provides out-of-the box indented output has been added.

For more details about new features and potential incompatibilities see the README.

Permalink

A Distributed Systems Reading List

2024/02/07

A Distributed Systems Reading List

This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.

Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.

It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.

Foundational theory

This is information providing the basics of all the distsys theory. Most of the papers or resources you read will make references to some of these concepts, so explaining them makes sense.

Models

In a Nutshell

There are three model types used by computer scientists doing distributed system theory:

  1. synchronous models
  2. semi-synchronous models
  3. asynchronous models

A synchronous model means that each message sent within the system has a known upper bound on communications (max delay between a message being sent and received) and the processing speed between nodes or agents. This means that you can know for sure that after a period of time, a message was missed. This model is applicable in rare cases, such as hardware signals, and is mostly beginner mode for distributed system proofs.

An asynchronous model means that you have no upper bound. It is legit for agents and nodes to process and delay things indefinitely. You can never assume that a "lost" message you haven't seen for the last 15 years won't just happen to be delivered tomorrow. The other node can also be stuck in a GC loop that lasts 500 centuries, that's good.

Proving something works on asynchronous model means it works with all other types. This is expert mode for proofs and is even trickier than real world implementations to make work in most cases.

The Semi-synchronous models are the cheat mode for real world. There are upper-bounds to the communication mechanisms and nodes everywhere, but they are often configurable and unspecified. This is what lets a protocol designer go "you know what, we're gonna stick a ping message in there, and if you miss too many of them we consider you're dead."

You can't assume all messages are delivered reliably, but you give yourself a chance to say "now that's enough, I won't wait here forever."

Protocols like Raft, Paxos, and ZAB (quorum protocols behind etcd, Chubby, and ZooKeeper respectively) all fit this category.

Theoretical Failure Modes

The way failures happen and are detected is important to a bunch of algorithms. The following are the most commonly used ones:

  1. Fail-stop failures
  2. Crash failures
  3. Omission failures
  4. Performance failures
  5. Byzantine failures

First, Fail-stop failures mean that if a node encounters a problem, everyone can know about it and detect it, and can restore state from stable storage. This is easy mode for theory and protocols, but super hard to achieve in practice (and in some cases impossible)

Crash failures mean that if a node or agent has a problem, it crashes and then never comes back. You are either correct or late forever. This is actually easier to design around than fail-stop in theory (but a huge pain to operate because redundancy is the name of the game, forever).

Omission failures imply that you give correct results that respect the protocol or never answer.

Performance failures assumes that while you respect the protocol in terms of the content of messages you send, you will also possibly send results late.

Byzantine failures means that anything can go wrong (including people willingly trying to break you protocol with bad software pretending to be good software). There's a special class of authentication-detectable byzantine failures which at least put the constraint that you can't forge other messages from other nodes, but that is an optional thing. Byzantine modes are the worst.

By default, most distributed system theory assumes that there are no bad actors or agents that are corrupted and willingly trying to break stuff, and byzantine failures are left up to blockchains and some forms of package management.

Most modern papers and stuff will try and stick with either crash or fail-stop failures since they tend to be practical.

See this typical distsys intro slide deck for more details.

Consensus

This is one of the core problems in distributed systems: how can all the nodes or agents in a system agree on one value? The reason it's so important is that if you can agree on just one value, you can then do a lot of things.

The most common example of picking a single very useful value is the name of an elected leader that enforces decisions, just so you can stop having to build more consensuses because holy crap consensuses are painful.

Variations exist on what exactly is a consensus, including does everyone agree fully? (strong) or just a majority? (t-resilient) and asking the same question in various synchronicity or failure models.

Note that while classic protocols like Paxos use a leader to ensure consistency and speed up execution while remaining consistent, a bunch of systems will forgo these requirements.

FLP Result

In A Nutshell

Stands for Fischer-Lynch-Patterson, the authors of a 1985 paper that states that proper consensus where all participants agree on a value is unsolvable in a purely asynchronous model (even though it is in a synchronous model) as long as any kind of failure is possible, even if they're just delays.

It's one of the most influential papers in the arena because it triggered a lot of other work for other academics to define what exactly is going on in distributed systems.

Detailed reading

Fault Detection

Following FLP results, which showed that failure detection was kind of super-critical to making things work, a lot of computer science folks started working on what exactly it means to detect failures.

This stuff is hard and often much less impressive than we'd hope for it to be. There are strong and weak fault detectors. The former implies all faulty processes are eventually identified by all non-faulty ones, and the latter that only some non-faulty processes find out about faulty ones.

Then there are degrees of accuracy:

  1. Nobody who has not crashed is suspected of being crashed
  2. It's possible that a non-faulty process is never suspected at all
  3. You can be confused because there's chaos but at some point non-faulty processes stop being suspected of being bad
  4. At some point there's at least one non-faulty process that is not suspected

You can possibly realize that a strong and fully accurate detector (said to be perfect) kind of implies that you get a consensus, and since consensus is not really doable in a fully asynchronous system model with failures, then there are hard limits to things you can detect reliably.

This is often why semi-synchronous system models make sense: if you treat delays greater than T to be a failure, then you can start doing adequate failure detection.

See this slide deck for a decent intro

CAP Theorem

The CAP theorem was for a long while just a conjecture, but has been proven in the early 2000s, leading to a lot of eventually consistent databases.

In A Nutshell

There are three properties to a distributed system:

  • Consistency: any time you write to a system and read back from it, you get the value you wrote or a fresher one back.
  • Availability: every request results in a response (including both reads and writes)
  • Partition tolerance: the network can lose messages

In theory, you can get a system that is both available and consistent, but only under synchronous models on a perfect network. Those don't really exist so in practice P is always there.

What the CAP theorem states is essentially that givenP, you have to choose either A (keep accepting writes and potentially corrupt data) orC (stop accepting writes to save the data, and go down).

Refinements

CAP is a bit strict in what you get in practice. Not all partitions in a network are equivalent, and not all consistency levels are the same.

Two of the most common approaches to add some flexibility to the CAP theorem are the Yield/Harvest models and PACELC.

Yield/Harvest essentially says that you can think of the system differently: yield is your ability to fulfill requests (as in uptime), and harvest is the fraction of all the potential data you can actually return. Search engines are a common example here, where they will increase their yield and answer more often by reducing their harvest when they ignore some search results to respond faster if at all.

PACELC adds the idea that eventually-consistent databases are overly strict. In case of network Partitioning you have to choose between Availability or Consistency, but Else --when the system is running normally--one has to choose between Latency and Consistency. The idea is that you can decide to degrade your consistency for availability (but only when you really need to), or you could decide to always forego consistency because you gotta go fast.

It is important to note that you cannot beat the CAP theorem (as long as you respect the models under which it was proven), and anyone claiming to do so is often a snake oil salesman.

Resources

There's been countless rehashes of the CAP theorem and various discussions over the years; the results are mathematically proven even if many keep trying to make the argument that they're so reliable it doesn't matter.

Message Passing Definitions

Messages can be sent zero or more times, in various orderings. Some terms are introduced to define what they are:

  • unicast means that the message is sent to one entity only
  • anycast means that the message is sent to any valid entity
  • broadcast means that a message is sent to all valid entities
  • atomic broadcast or total order broadcast means that all the non-faulty actors in a system receive the same messages in the same order, whichever that order is
  • gossip stands for the family of protocols where messages are forwarded between peers with the hope that eventually everyone gets all the messages
  • at least once delivery means that each message will be sent once or more; listeners are to expect to see all messages, but possibly more than once
  • at most once delivery means that each sender will only send the message one time. It's possible that listeners never see it.
  • exactly once delivery means that each message is guaranteed to be sent and seen only once. This is a nice theoretical objective but quite impossible in real systems. It ends up being simulated through other means (combining atomic broadcast with specific flags and ordering guarantees, for example)

Regarding ordering:

  • total order means that all messages have just one strict ordering and way to compare them, much like 3 is always greater than 2.
  • partial order means that some messages can compare with some messages, but not necessarily all of them. For example, I could decide that all the updates to the key k1 can be in a total order regarding each other, but independent from updates to the key k2. There is therefore a partial order between all updates across all keys, since k1 updates bear no information relative to the k2 updates.
  • causal order means that all messages that depend on other messages are received after these (you can't learn of a user's avatar before you learn about that user). It is a specific type of partial order.

There isn't a "best" ordering, each provides different possibilities and comes with different costs, optimizations, and related failure modes.

Idempotence

Idempotence is important enough to warrant its own entry. Idempotence means that when messages are seen more than once, resent or replayed, they don't impact the system differently than if they were sent just once.

Common strategies is for each message to be able to refer to previously seen messages so that you define an ordering that will prevent replaying older messages, setting unique IDs (such as transaction IDs) coupled with a store that will prevent replaying transactions, and so on.

See Idempotence is not a medical condition for a great read on it, with various related strategies.

State Machine Replication

This is a theoretical model by which, given the same sequences of states and the same operations applied to them (disregarding all kinds of non-determinism), all state machines will end up with the same result.

This model ends up being critical to most reliable systems out there, which tend to all try to replay all events to all subsystems in the same order, ensuring predictable data sets in all places.

This is generally done by picking a leader; all writes are done through the leader, and all the followers get a consistent replicated state of the system, allowing them to eventually become leaders or to fan-out their state to other actors.

State-Based Replication

State-based replication can be conceptually simpler to state-machine replication, with the idea that if you only replicate the state, you get the state at the end!

The problem is that it is extremely hard to make this fast and efficient. If your state is terabytes large, you don't want to re-send it on every operation. Common approaches will include splitting, hashing, and bucketing of data to detect changes and only send the changed bits (think of rsync), merkle trees to detect changes, or the idea of a patch to source code.

Practical Matters

Here are a bunch of resources worth digging into for various system design elements.

End-to-End Argument in System Design

Foundational practical aspect of system design for distributed systems:

  • a message that is sent is not a message that is necessarily received by the other party
  • a message that is received by the other party is not necessarily a message that is actually read by the other party
  • a message that is read by the other party is not necessarily a message that has been acted on by the other party

The conclusion is that if you want anything to be reliable, you need an end-to-end acknowledgement, usually written by the application layer.

These ideas are behind the design of TCP as a protocol, but the authors also note that it wouldn't be sufficient to leave it at the protocol, the application layer must be involved.

Fallacies of Distributed Computing

The fallacies are:

  • The network is reliable
  • Latency is zero
  • Bandwidth is infinite
  • The network is secure
  • Topology doesn't change
  • There is one administrator
  • Transport cost is zero
  • The network is homogeneous

Partial explanations on the Wiki page or full ones in the paper.

Common Practical Failure Modes

In practice, when you switch from Computer Science to Engineering, the types of faults you will find are a bit more varied, but can map to any of the theoretical models.

This section is an informal list of common sources of issues in a system. See also the CAP theorem checklist for other common cases.

Netsplit

Some nodes can talk to each other, but some nodes are unreachable to others. A common example is that a US-based network can communicate fine internally, and so could a EU-based network, but both would be unable to speak to each-other

Asymmetric Netsplit

Communication between groups of nodes is not symmetric. For example, imagine that the US network can send messages to the EU network, but the EU network cannot respond back.

This is a rarer mode when using TCP (although it has happened before), and a potentially common one when using UDP.

Split Brain

The way a lot of systems deal with failures is to keep a majority going. A split brain is what happens when both sides of a netsplit think they are the leader, and starts making conflicting decisions.

Timeouts

Timeouts are particularly tricky because they are non-deterministic. They can only be observed from one end, and you never know if a timeout that is ultimately interpreted as a failure was actually a failure, or just a delay due to networking, hardware, or GC pauses.

There are times where retransmissions are not safe if the message has already been seen (i.e. it is not idempotent), and timeouts essentially make it impossible to know if retransmission is safe to try: was the message acted on, dropped, or is it still in transit or in a buffer somewhere?

Missing Messages due to Ordering

Generally, using TCP and crashes will tend to mean that few messages get missed across systems, but frequent cases can include:

  • The node has gone down (or the software crashed) for a few seconds during which it missed a message that won't be repeated
  • The updates are received transitively across various nodes. For example, a message published by service A on a bus (whether Kafka or RMQ) can end up read, transformed or acted on and re-published by service B, and there is a possibility that service C will read B's update before A's, causing issues in causality

Clock Drift

Not all clocks on all systems are synchronized properly (even using NTP) and will go at different speeds.

Using a timestamp to sort through events is almost guaranteed to be a source of bugs, even moreso if the timestamps come from multiple computers.

The Client is Part of the System

A very common pitfall is to forget that the client that participates in a distributed system is part of it. Consistency on the server-side will not necessarily be worth much if the client can't make sense of the events or data it receives.

This is particularly insidious for database clients that do a non-idempotent transactions, time out, and have no way to know if they can try it again.

Restoring from multiple backups

A single backup is kind of easy to handle. Multiple backups run into a problem called consistent cuts (high level view) and distributed snapshots, which means that not all the backups are taken at the same time, and this introduces inconsistencies that can be construed as corrupting data.

The good news is there's no great solution and everyone suffers the same.

Consistency Models

There are dozens different levels of consistency, all of which are documented on Wikipedia, by Peter Bailis' paper on the topic, or overviewed by Kyle Kingsbury post on them

  • Linearizability means each operation appears atomic and could not have been impacted by another one, as if they all ran just one at a time. The order is known and deterministic, and a read that started after a given write had started will be able to see that data.
  • Serializability means that while all operations appear to be atomic, it makes no guarantee about which order they would have happened in. It means that some operations might start after another one and complete before it, and as long as the isolation is well-maintained, that isn't a problem.
  • Sequential consistency means that even if operations might have taken place out-of-order, they will appear as if they all happened in order
  • Causal Consistency means that only operations that have a logical dependency on each other need to be ordered amongst each other
  • Read-committed consistency means that any operation that has been committed is available for further reads in the system
  • Repeatable reads means that within a transaction, reading the same value multiple times always yields the same result
  • Read-your-writes consistency means that any write you have completed must be readable by the same client subsequently
  • Eventual Consistency is a kind of special family of consistency measures that say that the system can be inconsistent as long as it eventually becomes consistent again. Causal consistency is an example of eventual consistency.
  • Strong Eventual Consistency is like eventual consistency but demands that no conflicts can happen between concurrent updates. This is usually the land of CRDTs.

Note that while these definitions have clear semantics that academics tend to respect, they are not adopted uniformly or respected in various projects' or vendors' documentation in the industry.

Database Transaction Scopes

By default, most people assume database transactions are linearizable, and they tend not to be because that's way too slow as a default.

Each database might have different semantics, so the following links may cover the most major ones.

Be aware that while the PostgreSQL documentation is likely the clearest and most easy to understand one to introduce the topic, various vendors can assign different meanings to the same standard transaction scopes.

Logical Clocks

Those are data structures that allow to create either total or partial orderings between messages or state transitions.

Most common ones are:

  • Lamport timestamps, which are just a counter. They allow the silent undetected crushing of conflicting data
  • Vector Clocks, which contain a counter per node, incremented on each message seen. They can detect conflicts in data and on operations.
  • Version Vectors are like vector clocks, but only change the counters on state variations rather than all event seens
  • Dotted Version Vectors are fancy version vectors that allow tracking conflicts that would be perceived by the client talking to a server.
  • Interval Tree Clocks attempts to fix the issues of other clock types by requiring less space to store node-specific information and allowing a kind of built-in garbage collection. It also has one of the nicest papers ever.

CRDTs

CRDTs essentially are data structures that restrict operations that can be done such that they can never conflict, no matter which order they are done in or how concurrently this takes place.

Think of it as the specification on how someone would write a distributed redis that was never wrong, but only left maths behind.

This is still an active area of research and countless papers and variations are always coming out.

Other interesting material

The bible for putting all of these views together is Designing Data-Intensive Applications by Martin Kleppmann. Be advised however that everyone I know who absolutely loves this book are people who had a good foundation in distributed systems from reading a bunch of papers, and greatly appreciated having it all put in one place. Most people I've seen read it in book clubs with the aim get better at distributed systems still found it challenging and confusing at times, and benefitted from having someone around to whom they could ask questions in order to bridge some gaps. It is still the clearest source I can imagine for everything in one place.

Permalink

Counting Forest Fires

2024/01/26

Counting Forest Fires

Today I'm hitting my 3 years mark at Honeycomb, and so I thought I'd re-publish one of my favorite short blog posts written over there, Counting Forest Fires, which has become my go-to argument when discussing incident metrics when asked to count outages and incidents.

If you were asked to evaluate how good crews were at fighting forest fires, what metric would you use? Would you consider it a regression on your firefighters' part if you had more fires this year than the last? Would the size and impact of a forest fire be a measure of their success? Would you look for the cause—such as a person lighting it, an environmental factor, etc—and act on it? Chances are that yes, that's what you'd do.

As time has gone by, we've learned interesting things about forest fires. Smokey Bear can tell us to be careful all he wants, but sometimes there's nothing we can do about fires. Climate change creates conditions where fires are going to be more likely, intense, and uncontrollable. We constantly learn from indigenous approaches to fire management, and we now leverage prescribed burns instead of trying to prevent them all.

In short, there are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know the burden or impact they have, it isn't a legitimate measure of success. Knowing whether your firefighters or whether your prevention campaigns are useful can't rely on these high-level observations, because they'll be drowned in the noise of a messy unpredictable world.

Forest fires and tech fires: turns out they're not so different

The parallel to software is obvious: there are conditions we put in place in organizations—or the whole industry—and things we do that have greater impacts than we can account for, or that can't be countered by individual teams or practitioner actions. And if you want to improve things, there are things you can measure that are going to be more useful. The type of metric I constantly argue for is to count the things you can do, not the things you hope don't happen. You hope that forest fires don't happen, but there's only so much that prevention can do. Likewise with incidents. You want to know that your response is adequate.

You want to know about risk factors to alter behavior in the immediate, or of signals that tell you the way you run things needs to be adjusted in the long-term, for sustainability. The goal isn't to prevent all incidents, because small ones here and there are useful to prevent even greater ones or to provide practice to your teams, enough that you may want to cause controlled incidents on purpose—that's chaos engineering. You want to be able to prioritize concurrent events so that you respond where it's most worth it. You want to prevent harm to your responders, and know how to limit it as much as possible on the people they serve. You want to make sure you learn from new observations and methods, and that your practice remains current with escalating challenges.

Don't look at success/failure. It goes deeper than that.

The concepts mentioned above are things you can invest in, train people for, and create conditions that can lead to success. Counting forest fires or incidents lets you estimate how bad a given season or quarter was, but it tells you almost nothing about how good of a job you did.

It tells you about challenges, about areas where you may want to invest and pay attention, and where you may need to repair and heal—more than it tells you about successes or failures. Fires are going to happen regardless, but they can be part of a healthy ecosystem. Trying to stamp them all out may do more harm than good in the long run.

The real question is: how do you want to react? And how do you measure that?

Permalink

Elixir v1.16 released

Elixir v1.16 has just been released. 🎉

The Elixir team continues improving the developer experience via tooling, documentation, and precise feedback, while keeping the language stable and compatible.

The notable improvements in this release are the addition of compiler diagnostics and extensive improvements to our docs in the forms of guides, anti-patterns, diagrams and more.

Code snippets in diagnostics

Elixir v1.15 introduced a new compiler diagnostic format and the ability to print multiple error diagnostics per compilation (in addition to multiple warnings).

With Elixir v1.16, we also include code snippets in exceptions and diagnostics raised by the compiler, including ANSI coloring on supported terminals. For example, a syntax error now includes a pointer to where the error happened:

** (SyntaxError) invalid syntax found on lib/my_app.ex:1:17:
    error: syntax error before: '*'
    │
  1 │ [1, 2, 3, 4, 5, *]
    │                 ^
    │
    └─ lib/my_app.ex:1:17

For mismatched delimiters, it now shows both delimiters:

** (MismatchedDelimiterError) mismatched delimiter found on lib/my_app.ex:1:18:
    error: unexpected token: )
    │
  1 │ [1, 2, 3, 4, 5, 6)
    │ │                └ mismatched closing delimiter (expected "]")
    │ └ unclosed delimiter
    │
    └─ lib/my_app.ex:1:18

For unclosed delimiters, it now shows where the unclosed delimiter starts:

** (TokenMissingError) token missing on lib/my_app:8:23:
    error: missing terminator: )
    │
  1 │ my_numbers = (1, 2, 3, 4, 5, 6
    │              └ unclosed delimiter
 ...
  8 │ IO.inspect(my_numbers)
    │                       └ missing closing delimiter (expected ")")
    │
    └─ lib/my_app:8:23

Errors and warnings diagnostics also include code snippets. When possible, we will show precise spans, such as on undefined variables:

  error: undefined variable "unknown_var"
  │
5 │     a - unknown_var
  │         ^^^^^^^^^^^
  │
  └─ lib/sample.ex:5:9: Sample.foo/1

Otherwise the whole line is underlined:

error: function names should start with lowercase characters or underscore, invalid name CamelCase
  │
3 │   def CamelCase do
  │   ^^^^^^^^^^^^^^^^
  │
  └─ lib/sample.ex:3

A huge thank you to Vinícius Müller for working on the new diagnostics.

Revamped documentation

The ExDoc package provides Elixir developers with one of the most complete and robust documentation generator. It supports API references, tutorials, cheatsheets, and more.

However, because many of the language tutorials and reference documentation were written before ExDoc, they were maintained separately as part of the official website, separate from the language source code. With Elixir v1.16, we have moved our learning material to the language repository. This provides several benefits:

  1. Tutorials are versioned alongside their relevant Elixir version

  2. You get full-text search across all API reference and tutorials

  3. ExDoc will autolink module and function names in tutorials to their relevant API documentation

Another feature we have incorporated in this release is the addition of cheatsheets, starting with a cheatsheet for the Enum module. If you would like to contribute future cheatsheets to Elixir itself, feel free to start a discussion and collect feedback on the Elixir Forum.

Finally, we have started enriching our documentation with Mermaid.js diagrams. You can find examples in the GenServer and Supervisor docs.

Elixir has always been praised by its excellent documentation and we are glad to continue to raise the bar for the whole ecosystem.

Living anti-patterns reference

Elixir v1.16 incorporates and extends the work on Understanding Code Smells in Elixir Functional Language, by Lucas Vegi and Marco Tulio Valente, from ASERG/DCC/UFMG, into the official documention in the form of anti-patterns. Our goal is to provide examples of potential pitfalls for library and application developers, with additional context and guidance on how to improve their codebases.

In earlier versions, Elixir’s official reference for library authors included a list of anti-patterns for library developers. Lucas Vegi and Marco Tulio Valente extended and refined this list based on the existing literature, articles, and community input (including feedback based on their prevalence in actual codebases).

To incorporate the anti-patterns into the language, we trimmed the list down to keep only anti-patterns which are unambiguous and actionable, and divided them into four categories: code-related, design-related, process-related, and meta-programming. Then we collected more community feedback during the release candidate period, further refining and removing unclear guidance.

We are quite happy with the current iteration of anti-patterns but this is just the beginning. As they become available to the whole community, we expect to receive more input, questions, and concerns. We will continue listening and improving, as our ultimate goal is to provide a live reference that reflects the practices of the ecosystem, rather than a document that is written in stone and ultimately gets out of date. A perfect example of this is the recent addition of “Sending unnecessary data” anti-pattern, which was contributed by the community and describes a pitfall that may happen across codebases.

Type system updates

As we get Elixir v1.16 out of door, the Elixir team will focus on bringing the initial core for set-theoretic types into the Elixir compiler, with the goal of running automated analysis in patterns and guards. This is the first step outlined in a previous article and is sponsored by Fresha (they are hiring!), Starfish* (they are hiring!), and Dashbit.

Learn more

Other notable changes in this release are:

  • the addition of String.replace_invalid/2, to help deal with invalid UTF-8 encoding

  • the addition of the :limit option in Task.yield_many/2 that limits the maximum number of tasks to yield

  • improved binary pattern matching by allowing prefix binary matches, such as <<^prefix::binary, rest::binary>>

For a complete list of all changes, see the full release notes.

Check the Install section to get Elixir installed and read our Getting Started guide to learn more.

Happy learning!

Permalink

Elixir v1.16 released

Elixir v1.16 has just been released. 🎉

The Elixir team continues improving the developer experience via tooling, documentation, and precise feedback, while keeping the language stable and compatible.

The notable improvements in this release are the addition of compiler diagnostics and extensive improvements to our docs in the forms of guides, anti-patterns, diagrams and more.

Code snippets in diagnostics

Elixir v1.15 introduced a new compiler diagnostic format and the ability to print multiple error diagnostics per compilation (in addition to multiple warnings).

With Elixir v1.16, we also include code snippets in exceptions and diagnostics raised by the compiler, including ANSI coloring on supported terminals. For example, a syntax error now includes a pointer to where the error happened:

** (SyntaxError) invalid syntax found on lib/my_app.ex:1:17:
    error: syntax error before: '*'
    │
  1 │ [1, 2, 3, 4, 5, *]
    │                 ^
    │
    └─ lib/my_app.ex:1:17

For mismatched delimiters, it now shows both delimiters:

** (MismatchedDelimiterError) mismatched delimiter found on lib/my_app.ex:1:18:
    error: unexpected token: )
    │
  1 │ [1, 2, 3, 4, 5, 6)
    │ │                └ mismatched closing delimiter (expected "]")
    │ └ unclosed delimiter
    │
    └─ lib/my_app.ex:1:18

For unclosed delimiters, it now shows where the unclosed delimiter starts:

** (TokenMissingError) token missing on lib/my_app:8:23:
    error: missing terminator: )
    │
  1 │ my_numbers = (1, 2, 3, 4, 5, 6
    │              └ unclosed delimiter
 ...
  8 │ IO.inspect(my_numbers)
    │                       └ missing closing delimiter (expected ")")
    │
    └─ lib/my_app:8:23

Errors and warnings diagnostics also include code snippets. When possible, we will show precise spans, such as on undefined variables:

  error: undefined variable "unknown_var"
  │
5 │     a - unknown_var
  │         ^^^^^^^^^^^
  │
  └─ lib/sample.ex:5:9: Sample.foo/1

Otherwise the whole line is underlined:

error: function names should start with lowercase characters or underscore, invalid name CamelCase
  │
3 │   def CamelCase do
  │   ^^^^^^^^^^^^^^^^
  │
  └─ lib/sample.ex:3

A huge thank you to Vinícius Müller for working on the new diagnostics.

Revamped documentation

The ExDoc package provides Elixir developers with one of the most complete and robust documentation generator. It supports API references, tutorials, cheatsheets, and more.

However, because many of the language tutorials and reference documentation were written before ExDoc, they were maintained separately as part of the official website, separate from the language source code. With Elixir v1.16, we have moved our learning material to the language repository. This provides several benefits:

  1. Tutorials are versioned alongside their relevant Elixir version

  2. You get full-text search across all API reference and tutorials

  3. ExDoc will autolink module and function names in tutorials to their relevant API documentation

Another feature we have incorporated in this release is the addition of cheatsheets, starting with a cheatsheet for the Enum module. If you would like to contribute future cheatsheets to Elixir itself, feel free to start a discussion and collect feedback on the Elixir Forum.

Finally, we have started enriching our documentation with Mermaid.js diagrams. You can find examples in the GenServer and Supervisor docs.

Elixir has always been praised by its excellent documentation and we are glad to continue to raise the bar for the whole ecosystem.

Living anti-patterns reference

Elixir v1.16 incorporates and extends the work on Understanding Code Smells in Elixir Functional Language, by Lucas Vegi and Marco Tulio Valente, from ASERG/DCC/UFMG, into the official documention in the form of anti-patterns. Our goal is to provide examples of potential pitfalls for library and application developers, with additional context and guidance on how to improve their codebases.

In earlier versions, Elixir’s official reference for library authors included a list of anti-patterns for library developers. Lucas Vegi and Marco Tulio Valente extended and refined this list based on the existing literature, articles, and community input (including feedback based on their prevalence in actual codebases).

To incorporate the anti-patterns into the language, we trimmed the list down to keep only anti-patterns which are unambiguous and actionable, and divided them into four categories: code-related, design-related, process-related, and meta-programming. Then we collected more community feedback during the release candidate period, further refining and removing unclear guidance.

We are quite happy with the current iteration of anti-patterns but this is just the beginning. As they become available to the whole community, we expect to receive more input, questions, and concerns. We will continue listening and improving, as our ultimate goal is to provide a live reference that reflects the practices of the ecosystem, rather than a document that is written in stone and ultimately gets out of date. A perfect example of this is the recent addition of “Sending unnecessary data” anti-pattern, which was contributed by the community and describes a pitfall that may happen across codebases.

Type system updates

As we get Elixir v1.16 out of door, the Elixir team will focus on bringing the initial core for set-theoretic types into the Elixir compiler, with the goal of running automated analysis in patterns and guards. This is the first step outlined in a previous article and is sponsored by Fresha (they are hiring!), Starfish* (they are hiring!), and Dashbit.

Learn more

Other notable changes in this release are:

  • the addition of String.replace_invalid/2, to help deal with invalid UTF-8 encoding

  • the addition of the :limit option in Task.yield_many/2 that limits the maximum number of tasks to yield

  • improved binary pattern matching by allowing prefix binary matches, such as <<^prefix::binary, rest::binary>>

For a complete list of all changes, see the full release notes.

Check the Install section to get Elixir installed and read our Getting Started guide to learn more.

Happy learning!

Permalink

Negotiable Abstractions

2023/12/21

Negotiable Abstractions

When I used to write more software and do more architecture professionally (I still do, but less intensively so with the SRE title), one of the most important questions I'd seek answers to was "how do I cut this up?" Or more accurately, "how do I cut this up without it coming back to haunt me later?" My favorite guideline was to write code that is easy to delete, not code that is easy to maintain; Tef wrote about this more eloquently than I ever did. I long ago gave up on the idea of writing maintainable code rather than code that was articulated in the right places to make things easy to get rid of.

It's the kind of thing that made sense to me and with which I found more success, but without necessarily having a good underlying principle for it. As I've been slowly writing code for fun in my toy project doing file synchronization, I had the benefit of going slow, thinking longer, and revisiting decisions with rest and no production pressure. A recent change in there, along with tons of reading in the last few years sort of crystallized what feels like a good explanation of it.

In this post I'll cover what theoretically (and often practically) good abstractions require. I'll then use the changes I brought to my toy project to show how factors entirely unrelated to code can drive change and define abstractions, and how these changes in turn can end up re-defining the context in which the code runs, which in turn prompts re-framing the software itself. This, ultimately makes me believe abstractions are contextual, subjective, and therefore negotiable.

Theoretically good abstractions

One of my favorite sources for newer engineers trying to understand how to structure their software is John Ousterhout's A Philosophy of Software Design—I used it to start multiple book clubs in multiple organizations—which I wish had existed when I was newer to this industry.

There are quite a few heuristics in there to judge what makes a good abstraction:

  • Modular design is an ideal, which is hard to attain because modules must know about each other, and dependencies force them to change together; modular design minimizes dependencies.
  • Dividing interfaces and implementations is how we cut this up, splitting up what and how concerns.
  • Abstractions fundamentally omit unimportant details, which let us reason about things more simply. This word, unimportant, will be critical for us here. Abstractions that include unimportant details are more complicated than necessary. Omitting important details will cause confusion.
  • As a heuristic, the best abstractions tend to be deep: they have a narrow interface that lets you access a lot of functionality. Shallow modules, which have a wide interface and limited functionality tend to be poor abstractions. File I/O as offered by Linux is mentioned as a good, deep interface.
  • A good amount of care is taken to define information hiding and leaking; information leaks when multiple modules are impacted by a design decision.
  • It is more important for a module to have a simple interface than a simple implementation

The book contains many examples and tips on what makes abstractions good or bad, and how to recognize problematic ones. These rules of thumbs are all solid and hard to disagree with—many of them put into words things I had been doing for a while when I first read it years ago.

Whether it's tacit or not, these ideas of carefully choosing where complexity should live influence how I approach laying out systems, modules, and code.

Context Drives Change

There are important things that do not show up in code, however, and can come from elements far outside of it. To demonstrate that, I want to describe a recent change I made in ReVault, my toy project that does file synchronization.

To give you some context, ReVault works by scanning directories, getting the hash of files, tracking changes with interval tree clocks, and then comparing manifests to peers to know what needs to be synchronized and whether any concurrent changes happened that should be considered a conflict. Pretty much everything it does deals with files. And so while it was a bit tricky to choose how to cut up the various scanning, networking, and synchronization modules around both code and state boundaries, it was rather straightforward for file handling.

I leaned heavily on the POSIX interface offered by your usual standard library, with functions wrapping higher-level concepts like "serializing data and then overwriting a file safely," while mostly sticking to direct file usage otherwise. The file abstractions were tried and true, and they lined up nicely with all the basic operations I had to deal with. If I had to frame it according to A Philosophy of Software Design's principles, I avoided creating a lot of shallow modules and thin wrappers that provided little in terms of hiding complexity.

Earlier this year, I decided to experiment with an S3-based back-end, which would let me make an off-site copy of some files in a private bucket without needing to secure a VPS or server to make it safe to store data there. I expected this to be a rather straightforward mapping of calls from one interface to another. Sure, S3 is block storage, not file storage, but the API to it is very file-like so most of it ought to be compatible.

And it was, for the most part. The biggest difference for ReVault was that AWS charges per operation. If I left the API the same and just did a one-to-one substitution, all the scans and hash checks on a single synchronization of a moderately small directory could cost me 10 cents each time. This felt like a losing proposition; small frequent changes give the best expected result to reduce the probability of file conflicts, and incentivizing the opposite behavior by charging for empty syncs is ineffective design.

The obvious workarounds including tweaking the software's state machine to scan less if nothing changed, and to cache the hashes when using S3 to avoid re-fetching them if the files haven't changed since last run. Because of how S3 handles listing files, last modified timestamps, and hashes, the cache-based approach could be 20 times cheaper than the naive one.

This cache however only could be reliable and only made sense for S3, not the file system back-end. It also required refactoring the interface to the storage layer: it had to be shifted higher up into the application logic, in order to properly hide caching concerns, compatibility with disk storage, and ways it would intertwine with the state machine.

None of the business rules changed, none of the old file operations or concepts changed. To refer back to Ousterhout's book, nothing important or unimportant changed in what we expose in the interface. What changed is a cost structure, which is connected to contextual priorities (I don't want to pay much for my synchronization) in a way that the code has absolutely no concept of before or after the change.

This is an interesting property to highlight here: what makes a good or bad abstraction may have nothing to do with what is objectively observable in code (which doesn't mean it never does, far from it), and a lot to do with the desires of people around the system.

Change Creates New Contexts

It does not end there though. Somehow, making that well-encapsulated change with a brand new interface messed with other parts of the system as well. And not just in small ways: when I synchronized larger directories, the software would eventually freeze. There was no error, but it would start using so much memory that even the REPL I used to interact with the s3-side local node would freeze entirely and I'd need to hard kill the process: no way to debug it interactively.

What is it about the new file layer that was buggy? As it turns out, nothing. A bit of analysis showed that the S3 layer was perfectly functional, and so was the old disk layer. The new abstraction was fine.

The problem was that because I no longer needed a disk for one of my two peers, I started running both of them on the same computer: one using the disk, and one using S3. That shifted bottlenecks around:

An architecture diagram showing the first host with its disk and Erlang node, communicating to a second identical host over a public network, which is also marked as a bottleneck.

An architecture diagram showing a single host, on which two Erlang virtual machines are running. The first one talks to the local disk, and the second one to S3 over a public network. Both Erlang VMs talk together via the local network within a single host. The public network between the second VM and S3 is marked as a bottleneck.

This bottleneck was problematic because there existed one little point of asynchronous communications in the whole pipeline:

An architecture diagram showing a single host, on which two Erlang virtual machines are running. The first VM shows two internal processes: a FSM that talks to the local disk, and a TLS server that is in-between the FSM and the private network on the host. The second VM has a TLS process connected to the TLS process from the first VM, which sends data to its own FSM within this VM. That last FSM talks to S3 over the public network. All communication hops are marked as synchronous, aside from the TLS-to-FSM link on the second VM (the one that talks to S3)

This meant that reading from disk, going over the local network, and pushing data to the remote state machine went unhindered while memory kept accumulating in front of the slower S3 upload until the node became unresponsive. Making that hop between B's TLS and FSM processes synchronous fixed things.

Putting it all together, a change in interface to keep costs low on an otherwise equivalent feature set has an unexpected interaction with how I deploy software that shifts bottlenecks such that I need to adjust network-related communications in an entirely distinct component. Not so self-contained anymore. Even more than that, this whole "bottleneck-centric analysis" was not necessary (even if I was familiar with the concepts and had thought about it a bit) until I started accidentally moving it around and it suddenly gained importance.

Basically, in a world where you have imperfect knowledge that improves over time, the order in which you stumble upon discoveries shapes the system you build, which in turn is impacted in potentially surprising ways by seemingly unrelated changes.

New Contexts Mean New Boundaries

The key point I would make here is that what is important or not—one of the key factors in defining proper abstractions—is not an objective fact encoded into the world. It is a consequence of how we pick and choose boundaries based on anticipated use cases and a limited understanding of the current world, and that is allowed to change.

We like to think of our systems as inherently possible to analyze: we take the current implementation, stakeholders, their needs, and their experience, and if we study it long enough, we will have the ability to generate insights out of it and then create a better, more adequate system. The challenge here is that the rate of change in the world and for our experiences is out of our control. It is impossible for us to know what we will discover only later. Insights are often quite unpredictable. Yet, all of these can have a fundamental impact on how we judge the adequacy of abstractions and established structures.

Even though it is absolutely involved with code, things are important beyond the code, and therefore the abstractions within the code are bound to change due to factors entirely external and unavoidable to it.

But even then, there's something more general here. Every time we define a category, that we decide to re-arrange a messy complex world into neat boxes, we pick a subjective point of reference that defines the ideal case for that label. If I declare the colors "red" and "yellow", there's a gradient where something may be redder and something else yellowish, but there's also a kind of mess around the cases that would fall into the yet-undefined "orange". Whether we need it or not, and what it means to all other color definitions may practically depend on how much we encounter these cases, and how important they are.

Legacy software systems are defined in dozens of ways. My favorite one is probably just "a system that we no longer know how to or have the will to maintain." Another one here might be: a system whose fundamental abstractions are deeply rooted in a context that has changed so much it lost relevance. This may funnily line up with the theory that software engineers have to ditch older systems to truncate their history for code to keep acting as a commodity.

The heuristics given by Ousterhout are still good. I'd like to think they best fit this analytic pattern, the one of a frozen snapshot you reason about. These bouts of analysis must share room and be intertwined with sensemaking, and that can flip things upside down.

If our abstractions are subjective and contextual, then we ought to consider them negotiable. We can decide what makes a better or a worse abstraction based on what we deem important today, even if it's not in the code at all. But like all negotiations, we don't necessarily get to control when or how they happen.

Thanks to Amos King and Clark Lindsay for reviewing this text.

Permalink

Erlang/OTP 26.2 Release

OTP 26.2

Erlang/OTP 26.2 is the second maintenance patch package for OTP 26, with mostly bug fixes as well as improvements.

Highlights

  • process_info/2 now supports lookup of values for specific keys in the process dictionary.

Potential incompatibilities:

  • common_test now returns an error when a suite with a badly defined group is executed.

For details about bugfixes and potential incompatibilities see the Erlang 26.2 README

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.