Magma is a protocol for immutably storing snapshots of values which evolve over time, in a way that enables efficient updates from one value to a later one even if the source providing the update data cannot compute what changed between any two values. That is quite a mouthful, so we begin with an extended description of the problem that magma aims to solve.
Consider a data source producing a sequence of values. These values might be large, but any new value can be obtained from the preceding value through a small change — we can think of the sequence as describing a single, mutable variable changing over time.
The sequence is read by a server, which stores enough data to be able to reconstruct the value of the variable at any point of its evolution. The easiest way of accomplishing this is to store the full sequence of values, but other options exist.
Now consider a client that knows the value of the variable at a certain point of time, has obtained the identifier of a more recent value from a trusted source, and now wants to obtain that value from the server. There are some trivial strategies, but all of them are problematic:
Magma provides a more efficient trade-off between these extremes, by employing binary anti-monotone linking schemes. It further uses secure hashes to ensure that the server party need not be trusted by either the data source or the client.
We begin by examining how a data source can publish values such that clients can efficiently obtain updates from a server without requiring the server to be able to perform computations on the values. The magma protocol itself is introduced later.
The approach that magma uses applies to a wider class of data structures than the sequences from the introduction. We consider values with at most one preceding value, or in other words, forests of rooted trees labeled with changes. Or yet another viewpoint: we consider sets of sequences that may share a common past (i.e., a common prefix).
We call the pair $\sequence{G, \lbl}$ a labeled in-forest if $G$ is a forest of rooted trees with all edges oriented toward the roots.
The following figure shows an example labeled in-forest.
We now wish to find a data structure that allows to efficiently reconstruct a labeled in-forest by applying a series of changes. To obtain an efficient solution, we require that the grouping of how changes are applied is irrelevant (notice that the order of changes must however remain unchanged), this is captured in the concept of a semigroup.
A semigroup is a pair $\sequence{S, \groupAdd}$ of a set $S$ and a function $\fun{\groupAdd}{S \times S}{S}$ satisfying:
The core idea now is to create a graph describing the changes between different values in a labeled in-forest. Binary anti-monotone linking schemes have short paths between any pair of vertices, labeling the arcs of a binary anti-monotone linking scheme allows to efficiently reconstruct the accumulated changes. Associativity ensures that accumulating along skip links leads to the same results as accumulating along predecessor links. Formally:
$\DeclareMathOperator{\depthVanilla}{depth}\newcommand{\depth}[2]{\href{##forest_notation}{\depthVanilla}_{#2}({#1})}$The depth of $v$, written $\depth{v}{G}$, is the length of the path from $v$ to its root ($0$ in case $v$ is a root itself).
$\newcommand{\ancestor}[1]{\href{##forest_notation}{\leq}_{#1}}\newcommand{\nancestor}[1]{\href{##forest_notation}{\nleq}_{#1}}$We write $u \ancestor{G} v$ if there is a path from $v$ to $u$.
$\newcommand{\evolution}[3]{{#1}_{\sequence{{#2}, {#3}}}}$An $\sequence{f, \mathcal{S}}$-evolution of $\mathcal{F}$ is an arc-labeled directed graph $\evolution{E}{f}{\mathcal{S}} \defeq \sequence{\evolution{G}{f}{\mathcal{S}} = (\evolution{V}{f}{\mathcal{S}}, \evolution{A}{f}{\mathcal{S}}), \evolution{\lbl}{f}{\mathcal{S}}}$, such that:
$\begin{align*} \evolution{V}{f}{\mathcal{S}} &\defeq V \discriminatedUnion \set{\nil},\\ \evolution{A}{f}{\mathcal{S}} &\defeq A \cup \set{\sequence{\nil, v} \mid \text{$v \in V$ and $\depth{v}{G} = 0$}}\\ &\cup \{\sequence{u, v} \mid u, v \in V, u \ancestor{G} v,\\ &\text{\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;and $\depth{u}{G} + 1 = f(\depth{v}{G} + 1)$}\},\\ \end{align*}$
and for all $\sequence{v, u} \in \evolution{A}{f}{\mathcal{S}}$, $\lbl(u) \groupAdd \evolution{\lbl}{f}{\mathcal{S}}(\sequence{v, u}) = \lbl(v)$.
$\DeclareMathOperator{\lsTwoVanilla}{ls_{2}}\newcommand{\lstwo}[0]{\href{https://aljoscha-meyer.de/linkingschemes##lstwo}{\lsTwoVanilla}}$The following figure shows the unique $\sequence{\lstwo, \mathcal{Z}}$-evolution of Example 0.2, where $\mathcal{Z} \defeq \sequence{\Z, +}$ is the additive semigroup of integers. Notice how adding up the arc labels along any path from a vertex to the root yields the label of that vertex in Example 0.2.
If a server stores an evolution and a client wants to update a value, the server can answer with the arc labels along the shortest path between the requested value and the clients old value.
Because we consider settings where server does not perform any computation on the data it serves, the data source has to publish data in the correct form. More specifically, an update to the labeled in-forest produced by a data source must include the following information: the new vertex $v$, the predecessor vertex $p$, the change from the predecessor $\evolution{\lbl}{f}{\mathcal{S}}(\sequence{v, p})$, the skip link vertex $s$, and the change from that vertex $\evolution{\lbl}{f}{\mathcal{S}}(\sequence{v, s})$.
This concludes an abstract description of how magma solves the problem of providing incremental updates through untrusted intermediaries. The next section introduces the magma protocol proper, going into specifics of, e.g., the question of how peers can identify vertices, or how clients can be protected from servers feeding them garbage data.
$\newcommand{\h}[0]{\href{##h}{h}}\newcommand{\digest}[0]{\href{##digest}{\type{Digest}}}$Working with evolutions requires the ability to compactly reference pieces of data. Magma uses content-based addressing, the reference to a datum consists of the digest of a cryptographic hash function of that datum. Magma is generic over which hash function is used, an instantiation of the protocol must specify one. In the following text, we write $\cssId{h}{\fun{\h}{\bitstring}{\digest}}$ for that function, with $\cssId{digest}{\digest}$ being the set of bit-strings of some fixed length.
$\newcommand{\SM}[0]{\href{##Sm}{\mathcal{S}}}\newcommand{\S}[0]{\href{##S}{S}}\newcommand{\sAdd}[0]{\href{##s_add}{\groupAdd}}\newcommand{\encS}[0]{\href{##encS}{\enc_{\S}}}\newcommand{\EncS}[0]{\href{##EncS}{\type{Enc}_{\S}}}$An instantiation of magma must also specify the semigroup of the values, we write $\cssId{SM}{\SM} = \sequence{\cssId{S}{\S}, \cssId{s_add}{\sAdd}}$ for it. Furthermore, it must be possible to encode values as bit-strings, an instantiation must specify an injective encoding function $\fun{\cssId{encS}{\encS}}{\S}{\EncS}$ with $\cssId{EncS}{\EncS} \subseteq \bitstring$.
$\DeclareMathOperator{\lsThreeVanilla}{ls_{3}}\newcommand{\lsthree}[0]{\href{https://aljoscha-meyer.de/linkingschemes##lsthree}{\lsThreeVanilla}}$Magma uses $\lsthree$ as its linking scheme.
With these parameters fixed, we can now define magma events, the data published by a data source:
A magma event is an instance of the following datatype:
$\newcommand{\magmaEvent}[0]{\href{##magma_event_type}{\type{MagmaEvent}}}\newcommand{\rootEvent}[0]{\href{##root_event}{\constructor{RootEvent}}}\newcommand{\predChangeRoot}[0]{\href{##pred_change_root}{\field{\delta_S}}}\newcommand{\predLengthRoot}[0]{\href{##pred_length_root}{\field{\delta_{\mathit{len}}}}}\newcommand{\childEvent}[0]{\href{##child_event}{\constructor{ChildEvent}}}\newcommand{\depthEvent}[0]{\href{##event_depth}{\field{depth}}}\newcommand{\pred}[0]{\href{##pred}{\field{\delta}}}\newcommand{\predChange}[0]{\href{##pred_change}{\field{\delta_S}}}\newcommand{\predLength}[0]{\href{##pred_length}{\field{\delta_{\mathit{len}}}}}\newcommand{\skip}[0]{\href{##skip}{\field{\Delta}}}\newcommand{\skipChange}[0]{\href{##skip_change}{\field{\Delta_S}}}\newcommand{\skipLength}[0]{\href{##skip_length}{\field{\Delta_{\mathit{len}}}}}$The 64-bit integers used for $\depthEvent$, $\predLength$, and $\skipLength$ limit which evolutions can be represented: neither the depth of any vertex nor the size of the encoding of a semigroup value may exceed $2^{64} - 1$.
Because magma events reference each other via hashes, we have to specify a way of encoding them:
$\newcommand{\eventEncoding}[1]{\href{##magma_event_encoding}{\enc_{\value{event}}}({#1})}$The event encoding $\eventEncoding{e}$ of a magma event $e$ is a bit-string defined as follows:
We can now formalize the intuitive comments of Definition 0.7 by defining a mapping from the vertices of an evolution to magma events.
$\DeclareMathOperator{\eventVanilla}{event}\newcommand{\event}[1]{\href{##event_meaning}{\eventVanilla}_{#1}}$The magma event $\event{v}$ is defined as follows:
If $\depth{\evolution{G}{f}{\mathcal{S}}}(v) = 1$, then:
$\event{v} \defeq$ $\rootEvent$ {Otherwise, let $p$ be the predecessor of $v$, and let $s$ be the other out-neighbor of $v$ if one exists, or the predecessor of $v$ if $v$ has only one out-neighbor. Then:
$\event{v} \defeq$ $\childEvent$ {Data sources publish new additions to an evolution in the form of magma events, and servers relay those magma events to clients. Magma events contain more than the bare minimum of data that would be necessary to reconstruct an evolution. This additional data allows clients to detect malicious behavior in a server. The structure of their communication is described in the next section.
The general structure of the communication between client and server is that of the client making requests and the server responding with magma events and semigroup values.
A request contains the hash of the event encoding for which the client wishes to obtain the accumulated value, as well as the hash of the most recent magma event the client has.
The server responds with the magma events along the shortest path from the requested event to the old event, supplying the client with enough metadata to verify the integrity of the semigroup values the server transmits afterwards. The request furthermore contains some parameters determining which semigroup values are to be transmitted, as well as some options for skipping parts of the event transfer in case some of that information is already available to the client.
We give a precise definition of the options a client has to specify which data it is interested in:
$\newcommand{\Mode}[0]{\href{##Mode}{\type{Mode}}} \newcommand{\payloadNone}[0]{\href{##payload_none}{\value{None}}} \newcommand{\payloadAscending}[0]{\href{##payload_ascending}{\value{Asc}}} \newcommand{\payloadDescending}[0]{\href{##payload_descending}{\value{Desc}}} \newcommand{\payloadAllAscending}[0]{\href{##payload_all_ascending}{\value{AllAsc}}} \newcommand{\payloadAllDescending}[0]{\href{##payload_all_descending}{\value{AllDesc}}}$A request is an instance of the following datatype:
$\newcommand{\request}[0]{\href{##request_type}{\type{Request}}}\newcommand{\old}[0]{\href{##old}{\field{old}}}\newcommand{\new}[0]{\href{##new}{\field{new}}}\newcommand{\mode}[0]{\href{##mode}{\field{mode}}}\newcommand{\offsetEvent}[0]{\href{##event_offset}{\field{offset_{\mathit{event}}}}}\newcommand{\offsetValue}[0]{\href{##value_offset}{\field{offset_{\mathit{value}}}}}$Now we can specify the data transmitted by the server.
A response is an instance of the following datatype:
$\newcommand{\response}[0]{\href{##response_type}{\type{Response}}}\newcommand{\unknownEvent}[0]{\href{##unknown_event}{\constructor{UnknownEvent}}}\newcommand{\data}[0]{\href{##data}{\constructor{Data}}}\newcommand{\events}[0]{\href{##events}{\field{events}}}\newcommand{\additionalEvents}[0]{\href{##additional_events}{\field{additional\_events}}}\newcommand{\completePayload}[0]{\href{##complete_payload}{\field{complete\_payload}}}\newcommand{\values}[0]{\href{##values}{\field{values}}}$Regardless of the specifics of the communication protocol, a server sending a $\data$ response first transmits the $\events$ in order of descending depth. The client hashes the first received magma event and verifies that the resulting digest matches the one it requested. For all further magma events, the client verifies that the depth has the correct value (the correct position in the shortest path in the evolution).
Next, the server transmits the $\additionalEvents$, again in order of descending depth. The client again verifies that the depth of each magma event is correct.
Furthermore, whenever the client receives a magma event, it verifies that all incoming and outgoing links are consistent. The client computes the hash of the received event and verifies that it matches with the $\pred$ or $skip$ value of all known (to the client) magma events that correspond to in-neighbors in the evolution graph. The client does the same for the out-neighbors as well.
The server next communicates the value of $\completePayload$.
Then, the semigroup $\values$ of the response are transmitted sorted by the order of the depths of the associated magma events. Whether they are sorted in ascending or descending order is specified by the $\mode$ of the request. When the client receives a semigroup value, it verifies that its hash and length exactly match the ones given in the corresponding magma event.
We now give a formal definition of how a server maps requests to responses. To do so, we first need to introduce some notation to account for the fact that a server usually does not know all the magma events for a complete evolution.
A $\evolution{E}{\lsthree}{\SM}$-magma chamber is a pair $\sequence{E, U}$, where $E \subseteq \set{\event{v} \mid v \in \evolution{V}{\lsthree}{\SM}}$ and $U \subseteq \set{\evolution{\lbl}{\lsthree}{\SM}(\sequence{u, v}) \mid \sequence{u, v} \in \evolution{A}{\lsthree}{\SM}}$.
We can now define how to map a magma chamber and a request to a response.
If there is no $e \in E$ with $\h(\eventEncoding{e}) = \access{r}{\new}$, the response is $\unknownEvent$.
If $\access{r}{\old} \neq \nil$ and there is no $e \in E$ with $\h(\eventEncoding{e}) = \access{r}{\old}$, the response is $\unknownEvent$.
Otherwise, let $n, o \in \evolution{V}{\lsthree}{\SM}$ be the vertices with $\h(\eventEncoding{n}) = \access{r}{\new}$, and $\h(\eventEncoding{o}) = \access{r}{\old}$ if $\access{r}{\old} \neq \nil$, or $o \defeq \nil$ if $\access{r}{\old} = \nil$.
If $\depth{o}{\evolution{G}{\lsthree}{\SM}} \geq \depth{n}{\evolution{G}{\lsthree}{\SM}}$, then the response is $\data$ {
If $E$ contains the magma events for the vertices along a path from $n$ to some $e \in \evolution{V}{\lsthree}{\SM}$ with $e \neq o, \depth{e}{\evolution{G}{\lsthree}{\SM}} = \depth{o}{\evolution{G}{\lsthree}{\SM}}$, then the server proves to the client that $o \nancestor{\evolution{G}{\lsthree}{\SM}} n$ by responding as if it had received the request $\request$ {
Otherwise, there are two remaining cases: either $E$ contains the magma events for the vertices along the shortest path from $n$ to $o$, or it does not. If it does not, let $m \in \evolution{V}{\lsthree}{\SM}$ be the vertex of least depth such that $o \ancestor{\evolution{G}{\lsthree}{\SM}} m \ancestor{\evolution{G}{\lsthree}{\SM}} n$ and $E$ contains the magma events for the vertices along a path from $n$ to $m$. Let $P$ be that path. Then the response is $\data$ {
In the last remaining case, $E$ contains the magma events for the vertices along the shortest path from $n$ to $o$. Let $P$ be that path. The response then depends on $\access{r}{\mode}$:
The response is $\data$ {
The response is $\data$ {
The response is $\data$ {
A request has to specify the hash of the magma event to which it wants to update to. Lava extends magma with the powerful option of requesting new events without knowing their hash in advance. The basic idea is that specifying the hash of a well-known event $e$ of depth $d$ together with an depth offset $o$ is unambiguous if there is at most one event $v$ of depth $d + o$ with $e \ancestor{} v$. In case of ambiguity, the server can transmit data up until the first point of ambiguity.
If servers indiscriminately store all new events, regardless of the data source, then it becomes unlikely that any offset-based request is unambiguous. Lava introduces the concept of authorship so that a data source can ensure that offset-based requests can always be served, by producing at most one successor event per event.
$\newcommand{\sign}[0]{\href{##sign}{\mathit{sign}}}\newcommand{\Pk}[0]{\href{##pk}{\type{PK}}}\newcommand{\Sk}[0]{\href{##sk}{\type{SK}}}\newcommand{\verify}[0]{\href{##verify}{\mathit{verify}}}\newcommand{\signature}[0]{\href{##signature}{\type{Sig}}}$Authorship is implemented by signing all events with a digital signature. An instantiation of the lava protocol must fix a signature scheme, we write $\cssId{signature}{\signature}$ for the type of signatures (bit-strings of some fixed length), $\cssId{pk}{\Pk}$ for the type of public keys (bit-strings of some fixed length), $\cssId{sk}{\Sk}$ for the type of secret keys (bit-strings of some fixed length), $\cssId{sign}{\sign}$ for the signing function and $\cssId{verify}{\verify}$ for the verification function.
Offset-based requests are not the only way for requesting unknown data. An alternative is to specify a maximum payload size and then receive data until the total size of the encodings of semigroup values reaches that number.
We can generalize this by regarding the encoding size as a particular type of metadata. A different example for metadata could be the time that has passed since creating the previous entry. This would allow queries based on time windows.
Other types of metadata can be defined as well, the important part is that the value is strictly increasing, so that a server has always at most one point at which it should end the response. It must also be possible to aggregate the metadata within the events, so there must be an associative function for combining metadata entries.
We call $R$ a total order if it satisfies the following properties:
An ordered semigroup is a semigroup $\sequence{S, \groupAdd}$ and a total order $R \subseteq S \times S$ such that for all $x, y \in S$ it holds that $\sequence{x, x \groupAdd y} \in R$.
$\newcommand{\k}[0]{\href{##k}{k}}\newcommand{\encM}[1]{\href{##encM}{\enc_{M_{#1}}}}$Lava cannot anticipate all possible needs for metadata, so it is generic over the metadata it supports. An instantiation of the protocol must fix $\k$ ordered semigroups $\mathcal{M}_i = \sequence{M_i, \groupAdd_i}, 1 \leq i \leq \k$ that describe the types of metadata associated with each event. Furthermore, it must be possible to encode metadata values as bit-strings, an instantiation must specify an injective encoding function $\fun{\cssId{encM}{\encM{i}}}{M_{i}}{\bitstring}$ for $1 \leq i \leq \k$.
With those concepts introduced, we can give the formal definitions that underlie lava. Everything builds on labeled in-forests again, but every vertex now also holds $\k$ pieces of metadata and is associated to an author.
$\newcommand{\metadata}[0]{\href{##metadata}{\mathit{metadata}}}\newcommand{\author}[0]{\href{##author}{\mathit{author}}}$A labeled in-forest with metadata and authorship is a tuple $\sequence{G, \lbl, \cssId{author}{\author}, \cssId{metadata}{\metadata_1}, \ldots, \metadata_{\k}}$, where $\sequence{G = (V, A), \lbl}$ is a labeled in-forest, $\fun{\author}{V}{\Pk}$ maps vertices to their author, and $\fun{\metadata_i}{V}{M_i}, 1 \leq i \leq \k$ are functions associating metadata with each vertex.
Next, we extend the concept of an evolution with authorship and additional arc labels for each piece of metadata.
$\newcommand{\evolutionLava}[4]{{#1}_{\sequence{{#2}, {#3}, {#4}}}}$An $\sequence{f, \mathcal{S}, \mathcal{M}}$-evolution with metadata and authorship of $\mathcal{F}$ is a labeled directed graph with multiple labeling functions $\evolutionLava{E}{f}{\mathcal{S}}{\mathcal{M}} \defeq \sequence{\evolutionLava{G}{f}{\mathcal{S}}{\mathcal{M}} = (\evolutionLava{V}{f}{\mathcal{S}}{\mathcal{M}}, \evolutionLava{A}{f}{\mathcal{S}}{\mathcal{M}}), \fun{\author}{V}{\Pk}, \evolutionLava{\lbl}{f}{\mathcal{S}}{\mathcal{M}}, \evolutionLava{\metadata_0}{f}{\mathcal{S}}{\mathcal{M}}, \ldots, \evolutionLava{\metadata_0}{f}{\mathcal{S}}{\mathcal{M}}}$, such that:
$\sequence{\evolutionLava{G}{f}{\mathcal{S}}{\mathcal{M}}, \evolutionLava{\lbl}{f}{\mathcal{S}}{\mathcal{M}}}$ is a $\sequence{f, \mathcal{S}}$-evolution
and $\fun{{\evolutionLava{\metadata_i}{f}{\mathcal{S}}{\mathcal{M}}}_i}{\evolutionLava{A}{f}{\mathcal{S}}{\mathcal{M}}}{M_i}, 1 \leq i \leq \k$ are functions such that for all $\sequence{v, u} \in \evolution{A}{f}{\mathcal{S}}$, $\metadata_i(u) \groupAdd_i \evolutionLava{\metadata_i}{f}{\mathcal{S}}{\mathcal{M}}(\sequence{v, u}) = \metadata_i(v)$.
Now, we can specify the pieces of data by which data sources produce such evolutions with metadata and authorship. This is where signatures enter the picture.
A lava event is an instance of the following datatype (with fields for all $1 \leq i \leq \k$):
$\newcommand{\eventLava}[0]{\href{##event_lava}{\type{LavaEvent}}}\newcommand{\magma}[0]{\href{##magma}{\field{magma}}}\newcommand{\sig}[0]{\href{##sig}{\field{sig}}}\newcommand{\predMeta}[0]{\href{##pred_meta}{\field{meta_{\delta, i}}}}\newcommand{\skipMeta}[0]{\href{##skip_meta}{\field{meta_{\Delta, i}}}}$$\newcommand{\eventEncodingLava}[1]{\href{##lava_event_encoding}{\enc_{\value{event}}}({#1})}$The event encoding $\eventEncodingLava{e}$ of a lava event $e$ is the bit-string $\access{e}{\sig}$, followed by $\eventEncoding{\access{e}{\magma}}$, followed by ($\encM{i}(\access{e}{\predMeta})$ followed by $\encM{i}(\access{e}{\skipMeta})$) for all $1 \leq i \leq \k$.
We can now formalize the intuitive comments of Definition 0.18 by defining a mapping from the vertices of an evolution with metadata and authorship to lava events.
TODO
$\newcommand{\ModeLava}[0]{\href{##ModeLava}{\type{Mode}}} \newcommand{\modeLavaLazy}[0]{\href{##mode_lava_lazy}{\value{Lazy}}} \newcommand{\modeLavaEvents}[0]{\href{##mode_lava_events}{\value{Events}}} \newcommand{\modeLavaBoth}[0]{\href{##mode_lava_both}{\value{Both}}}$A lava request is an instance of the following datatype:
$\newcommand{\requestLava}[0]{\href{##request_type_lava}{\type{LavaRequest}}}\newcommand{\magmaRequest}[0]{\href{##magma_request}{\constructor{MagmaRequest}}}\newcommand{\bla}[0]{\href{##bla}{\field{bla}}}\newcommand{\relative}[0]{\href{##relative}{\constructor{Relative}}}\newcommand{\oldLava}[0]{\href{##old_lava}{\field{old}}}\newcommand{\changeDepth}[0]{\href{##change_depth}{\field{change_{\depthEvent}}}}\newcommand{\changeLength}[0]{\href{##change_length}{\field{change_{\mathit{len}}}}}\newcommand{\changeMeta}[0]{\href{##change_meta}{\field{change_{meta_i}}}}\newcommand{\key}[0]{\href{##key}{\field{key}}}\newcommand{\increasing}[0]{\href{##increasing}{\field{increasing}}}\newcommand{\all}[0]{\href{##all}{\field{all}}}\newcommand{\modeLava}[0]{\href{##modeLava}{\field{mode}}}$A response to a $\relative$ consists of a sequence alternating between events and semigroup values (if the $\modeLava$ is $\modeLavaBoth$), up until (and excluding) the first event for which one of the accumulated changes exceeds one of the maxima given in the request. The sequence also ends if the server has multiple descendent events with the same depth. In that case, the sequence ends just before those events, and is then followed by a list of all those events. If the $\modeLava$ is $\modeLavaEvents$, the sequence consists just of the events, the semigroup values are omitted. If the $\modeLava$ is $\modeLavaLazy$, rather than transmitting the sequence, the server transmits a sequence of 64-bit integers that sum up to the total change in depth over the sequence (this is transmitted as a sequence of integers rather than a single integer so that the server can notify the client about new events becoming available to the server).
TODO