Distributed Mutability Without Link Rot

The web suffers from link rot: hyperlinks that were valid upon creation will often point into nothingness after some time has passed. Within the framework of the web, the typical solution is to link to a snapshot of the original page, hosted on archive.org.

This is not a true solution, however, since a link to the internet archive can rot just like any other link. The archivists best intentions do not protect against targeted attacks to bring down their servers.

The only foolproof solution is to forego hyperlinks to other servers, and to instead copy11I deliberately ignore dynamically generated content. I wish to discuss interlinked data evolving over time, not computation-as-a-service. their data onto your own server. The obvious inefficiencies — especially once everyone does this and copying some site would involve recursively copying everything that it had previously copied — can be neatly solved by employing content-addressing: you link to some data by giving its secure hash, and the software transparently fetches, caches, and deduplicates the data. This is essentially the pitch of the IPFS.

But if sufficiently efficient verbatim inclusion of data became the core mechanism for hyperlinking, something would be lost. When I link to a friend’s website, I know that readers will always get the up-to-date version. With a snapshot, the majority of readers would get a stale and incomplete version.

Personal websites, blogs and microblogs, forums, and the connections between all of these — indeed most facets of the web that make it feel worthwhile, alive, and human to me — rely on content changing over time. Whenever a project promises to deliver “the immutable web”, I internally replace those words with “censorship-resistant collections of data sets”. Those are worthwhile engineering feats in their own, but they serve different needs than the web.

A quick look at the current landscape of popular distributed systems can make this look like a dilemma: systems which introduce mutability seem to invariably be plagued by link rot. It can feel like either you embrace active mutability at the cost of passive link rot,Active mutation might invalidate links when the author deliberately removes a page, but I do not consider this link rot. Link rot stems from inaction, not from delibarately exercised agency. The latter I consider a feature, not a bug. or you eliminate passive link rot at the cost of active mutability.

That would be a wrong conclusion, there do exist well-known techniques that allow for productive mutability without suffering from link rot. In this post, I want to provide a brief overview of the problem space, to serve as a foundation for future system designs.

Names and (Im)mutability

Data is, by definition, immutable. You cannot change the string "Hello World!", just like you cannot change the number 17. Neither can you change the source code of this website, it just is. When we talk about mutability in computer science, we mean something else: we map some identifier (a name) to some data, and later, we can map the same identifier to different data. The variable my_favourite_string might first be mapped to "Hello World!", and later to "I love you". On the web, URLs take on the role of these names.

In programming, the simplemost case of names is not used for mutability, but merely to make working with large values (or the results of complex computations) easier. Many programming languages offer deliberately immutable bindings from names to values: you give a name to some expression, and you are not allowed to reassign that name to any other expression. As a consequence, you can now use the name and the expression interchangeably22Repeatedly replacing all occurrences of a name with the data to which it was bound is the core operation of the lambda calculus, one of the oldest formal models of computation.. There are computational and notational benefits compared to copy-pasting the same complex expressions over and over: it suffices to evaluate the expression only once and to then reuse the result; and programs become shorter. But semantically, there is no difference.

Content-addressing implements an analogous concept: referencing some data by its name (i.e., its secure hash) is convenient and efficient, but otherwise semantically equivalent to supplying a verbatim copy of the referenced data. The notion of deleting data that somebody else is referencing makes little sense; if the other person had copied your data verbatim into their own data, you would not expect to be able to selectively delete those sections either. Another obvious technical limitation is the impossibility of cyclic references33A challenge to the programming language theory enthusiasts: it seems fair to draw a parallel between cyclic immutable references and mutually recursive functions. The lambda calculus does not look like it would support recursion, let alone mutual recursion, but along come fixpoint combinators and suddenly you get surprisingly close. What is an analogous concept for cyclic references in content-addressed hypertext systems?.

Names on the Web

The hyperlinks of the web, in contrast, implement mutable mappings — clicking on the same link on different days can yield different results. The process from clicking a link to obtaining the data is quite involved, but for this discussion we can simplify it down to two separate steps: first, a URL is resolved into an IP address, and second, the server running at that address is contacted and asked for its data.

Translation of a URL into an IP address — performed by the Domain Name System (DNS) — primarily44Technically, the DNS can also be used to implement mutability, by over time mapping the same URL to different IP addresses serving different content. Using this mechanism for updating content is so unidiomatic that I will disregard it.
The DNS can also be used for optimisations like mapping the same URL to different IP addresses serving identical data, based on physical proximity to the requestor. These optimisations do not meaningfully change the conceptual model, so I abstract over them. serves to make the names of the web human-friendly. Working with 2001:df2:e500:ed1a::1 is a lot less convenient than working with wikipedia.org.

Unfortunately, human-readable names are a scarce and quite valuable resource. The DNS rents out names on a per-year basis. If you stop paying, all links to your site will stop working. This all but guarantees link rot! From this, we can derive several necessary conditions for avoiding link rot:Petname systems make for an example of how to introduce human-friendly names without inducing link rot. A petname, to sketch a rough definition, is any name that a user actively assigns for their own use — think the name you give to a contact on your phone. Petnames are inherently subjective, though their assignment could in principle be delegated instead of being carried out manually.
See also: Zooko’s Triangle.

Do not rent out names.
Do not make name resolution dependent on scarce resources.
Do not tie name resolution to real-world time.

After the DNS has resolved a URL into an IP address, the second phase of name resolution on the web takes place: the user agent contacts the server at the given IP address, and the server returns the data it stores at that point in time. This is the primary mechanism for mutability on the web: the same server can store and return different data at different points in time.

This location-based mutability is analogous to imperative programming: a variable name compiles down to a physical memory address on a computer. Whenever a program access the value of a variable, the computer supplies the current value stored at the corresponding address.

In a distributed system, location-based name resolution inherently leads to link rot.Like the web, the Fediverse runs on location-based mutability (as well as DNS). I will be sad to watch large parts of it be swallowed by link rot over the next decades. The server running at the location needs constant attention (or, at the very least, power), or it will be unable to supply data. Attackers or accidents might shut down the server. And censorship could prevent communication with the server.

Signed Bindings

A different approach to name-based mutability is what I will call55I personally have first encountered this technique in Secure Scuttlebutt and Earthstar; the IPNS cites the Self-certifying File System as inspiration. It seems safe to say that this technique has been and will be reinvented several times, but I am not aware of a common name for it. So I am giving it one simply to make writing this post more convenient. signed bindings. A name is simply the public key of a digital signature scheme. An author binds data to the name by signing the concatenation of the data66A straightforward optimisation is to sign a secure hash of the data instead. and a timestamp. When the same author binds different data to the same identifier, the binding with the greater timestamp wins. Services that provide name resolution cannot present fraudulent data, because they cannot forge the signature that the requestor will verify. The worst they can do is to deliberately present outdated data — but no system can force other actors to always divulge (new) data anyway.

I use timestamp in a wide sense here, anything that allows for tie-breaking between competing bindings works. A non-exhaustive selection of choices:

A wall-clock timestamp (Earthstar).
A logical counter (IPNS).The IPNS gives bindings a time-to-live, and the default resolution substrate limits it to 48 hours. This engineering concession sadly results in aggressive link rot.
A secure hash of the previous binding (Secure Scuttlebutt).
A secure hash of two previous bindings (Reed).
A secure hash of a block in a blockchain.
A date on which you placed a secure hash of the data as an advertisement in the New York Times.

The choice of timestamps can have far-reaching consequences. Of particular interest are single-writer versus multi-writer properties. With wall-clock timestamps,We have a discussion of the consequences of using claimed wallclock timestamps on the Willow page. multiple devices can publish data without needing significant coordination (beyond keeping their clocks in rough sync). With logical counters, when two devices publish new data logically concurrently, tiebreaking is needed, and that tiebreaking probably does not always align with user intuition about the passage of physical time. Timestamps based on hash chains enforces mutually exclusive write access to publish updates — concurrent writes can be detected, and typically invalidate the name.

Beyond implying single-writer77Running a distributed mutual-exclusion protocol to manage multiple writers on different physical devices essentially converts them into a single logical writer. systems, another important facet of hash chains is that they only allow for non-destructive mutation. Given that inclusion of a secure hash is semantically equivalent to including the hashed data verbatim, such name bindings only ever append new data, while keeping around all old data.

Other properties of signed bindings do not depend on the specific choice of timestamps. A fundamental one is that names do not include any information on how to resolve them — in contrast to an address, which essentially is the relevant information on how to resolve it. Systems built on signed bindings need to provide infrastructure for name resolution,I drafted quite a few paragraphs of detailed discussion, before deciding this should be left out of scope of this post. such as distributed hash tables (DHTs) or gossip protocols.

Resilient name resolution of signed bindings always builds on the fact that data need not be served by the original author: everyone can verify the signature for the bound data, so there is no need to trust the peer who gives you the data. Introducing intermediaries does open up questions of freshness. Did you get up-to-date data bound to the name, or is the data stale? Typically88A prominent exception is Nostr, which does not even guarantee eventual consistency., systems aim for eventual consistency, where all peers will eventually obtain the most up-to-date data once updates stop and the system keeps running long enough.

While resolving an address and contacting a server directly may feel like a stronger guarantee of recency, you can never know whether the server did not happen to update its data immediately after serving your request.The naming model I discuss in this post is purely pull-based. Efficient systems should probably incorporate push-based solutions (subscriptions to binding updates) as well. So you cannot ever be certain to be fully up-to-date in address-based settings either. And with enough engineering effort, you can reduce the delay in name resolution in signed-binding systems quite far as well. In particular, you could annotate names with the (IP) address of a recommended data source to contact for resolution.More broadly speaking, there is always an engineering space where you can overlay centralised architectures over a decentralised system to achieve certain optimisations. The important part is to make absolutely sure that things will continue to work when the centralised components fail. Such a system would have all the advantages of regular address-based systems while the recommended server is online, and would fail gracefully to a slower but more resilient resolution mechanism to prevent link rot.

When everyone can serve data to everyone else, the responsibility for keeping data around is diluted. In this sense, no system can fully prevent link rot: when the last remaining copy of some data is lost, it cannot be retrieved any longer. But, conceptually, each binding itself lasts as long as its digital signature scheme remains secure.

For every proposal to add content-addressing to the Fediverse to move beyond reliance on individual instances (i.e., addressed locations), there could be an equivalent proposal championing signed bindings.Perhaps curiously, signed bindings share much the same challenges as systems based on content-addressing. Both deal in names that are not human-readable, and both deal in names that carry no information in how to resolve them. I find it fair to say that if content-addressed systems can be made to work at scale99Did you know that “a dominant global system used by pretty much everyone, their dog, and their dog’s dozens of IoT devices” is not the only scale worth running systems at?, so can systems based on signed bindings.

There is a lot more design space for naming systems which enable mutation. Some examples:

Who should get to resolve names? Can everyone resolve everything, or is it possible to introduce some level of access control?
Should names carry additional structure? Willow,Willow can also mediate access control via this structure. for example, arranges names in a three-dimensional space, and then allows for spatial queries to resolve names in bulk.
Are there decentralised, rot-resistant techniques beyond signed bindings?

For now, I hope to have provided a useful overview of the basics.

The Actually Important Questions

I am channeling Joseph Weizenbaum in his radiant Computer Power and Human Reason here.I am also channeling Cade Diehm, whose writing has influenced how I approach systems design and who gave me immensely helpful feedback on an early draft of this post.The previous sections were purely technical, they discussed what can be done. Much more important, though, is what should be done. I personally believe strongly that we need resilient systems that do not suffer from link rot yet allow for active mutation. And I am unhappy how much of the time, energy, and resources that go into research and development of resilient systems go into systems with no or purely non-destructive notions of mutability.

Immutable systems based on content-addressable storage are elegant solutions to important concerns: censorship-resistantIn my following arguments in favour of mutable names, it is crucial that mutable approaches need not suffer from passive link rot either — otherwise, their benefits would be counteracted by lack of durability and resilience, and we would end up comparing apples and oranges. access to information, automatic durable archiving of valuable material, compression of data (deduplication is a special form of compression).

As such, I do hope that content-addressing will be an important tool in the distributed information systems of the future. But I believe that content-addressing should not be made the defining feature of such systems. I want to argue that a society would be impoverishedI fully subscribe to the assumption that the medium is the message. by making immutable links the default way of connecting its artifacts.

Publishing always sacrifices some agency over the data: whoever consumes the data can take a screenshot and put it on their fridge, and no computer protocol will allow the author to remove that printout. But a mutable link lets the author retain control over the outcome of the intended way of accessing the data.When I publish a mutable link to some data, I give agency to the author of that data. They are free to change the contents in the future, or even to fully remove the data. Like a verbatim copy, an immutable link would not award this agency.

By granting this agency, I voluntarily cede control. The quality of my work will vary based on the future actions of another human. A web of mutable references is a web of human interdepence, and builds on the voluntary giving of trust.

Freely given trust and consensual interdependence are deeply human, they sit at the root of the most positive relationships we can form. I want to immerse myself (and have others immerse themselves) in a medium that fosters these notions, and I want to stay clear from media whose design choices eliminate them.

A medium of mutable links should further teach an important lesson: trust must never be without alternative. I can choose to copy1010Or, ideally, I would be able to refer to the content via content-addressing, which has much better usability. content and retain full control over it. If trust is not an active choice but the only option, it becomes meaningless.

I imagine that the vocal proponents of eliminating as much trust from distributed systems as possible arrive at this decision because they primarily reflect on involuntarily given trust1111A prime example would be the forced trust in and usage of the currency of the nation state you happen live in — it certainly is no coincidence that cryptocurrency, blockchain, and code-is-law enthusiasts frequently employ narratives of trustlessness.. Such indiscriminate opposition to all (giving of) trust makes me quite sad. Instead, I want to see systems that minimise involuntary, coerced trust, while simultaneously maximising opportunities for voluntarily granting trust.

And that is why I, like anybody would, decided to write a post on mutable name bindings.

The analysis of the implications of mutability without link rot is multifaceted. I personally am passionate about the beauty1212Let alone the engineering opportunities you get from trusting other agents in a distributed system. But that is a different topic for a different piece of writing. of meaningfully given trust, so that is what I want to contribute here the most. But there are numerous other important issues as well, and things are rarely black and white. The inability to retract one’s material can cause harm. But deletion of publicly valuable information can also cause harm. Sometimes, active moderation of content by parties other than the authors could prevent harm. Immutability can support accountability, can counteract gaslighting or the rewriting of alternate facts into the past. How do we deal with lies and manipulation? What are the roles of apologies and foregiveness?

I do not have all the answers, and neither does anyone else. But we should normalise discussions on these issues. Distributed systems designs are not neutral1313And I wish the academic computer science community would stop pretending otherwise., neither politically nor socially. When I see a new project pop up that glosses over all of these issues and does not position itself, I am not interested. I care less for what you can do, and more for what you consider worth doing.

I believe that systems which neither suffer from link rot nor force permanence on all participants, which stay resilient yet allow for mutation, should receive more attention. How widely can they be deployed? How accessible can they be, and to whom? How can we categorise different offshoots, how can we explore, chart, and analyse the design space? Which patterns and actors will emerge, who can, should, and will shape these systems? How can we ensure that as many people as possible who wish to engage with them can benefit?

I believe more people should work toward finding the answers.

Home