

# Semantics of Remote Direct Memory Access: Operational and Declarative Models of RDMA on TSO Architectures

[GUILLAUME AMBAL](HTTPS://ORCID.ORG/0000-0002-4667-7266)<sup>∗</sup> , Imperial College London, UK [BRIJESH DONGOL,](HTTPS://ORCID.ORG/0000-0003-0446-3507) University of Surrey , UK [HAGGAI ERAN,](HTTPS://ORCID.ORG/0000-0002-2159-9046) NVIDIA, Israel [VASILEIOS KLIMIS,](HTTPS://ORCID.ORG/0000-0002-3173-8636) Queen Mary University of London, UK [ORI LAHAV,](HTTPS://ORCID.ORG/0000-0003-4305-6998) Tel Aviv University, Israel [AZALEA RAAD,](HTTPS://ORCID.ORG/0000-0002-2319-3242) Imperial College London, UK

Remote direct memory access (RDMA) is a modern technology enabling networked machines to exchange information without involving the operating system of either side, and thus significantly speeding up data transfer in computer clusters. While RDMA is extensively used in practice and studied in various research papers, a formal underlying model specifying the allowed behaviours of concurrent RDMA programs running in modern multicore architectures is still missing. This paper aims to close this gap and provide semantic foundations of RDMA on x86-TSO machines. We propose three equivalent formal models, two operational models in different levels of abstraction and one declarative model, and prove that the three characterisations are equivalent. To gain confidence in the proposed semantics, the more concrete operational model has been reviewed by NVIDIA experts, a major vendor of RDMA systems, and we have empirically validated the declarative formalisation on various subtle litmus tests by extensive testing. We believe that this work is a necessary initial step for formally addressing RDMA-based systems by proposing language-level models, verifying their mapping to hardware, and developing reasoning techniques for concurrent RDMA programs.

CCS Concepts: • Software and its engineering  $\rightarrow$  Formal language definitions; • Theory of computation  $\rightarrow$  Program semantics; Distributed computing models; • Hardware  $\rightarrow$  Testing with distributed and parallel systems.

Additional Key Words and Phrases: RDMA, Operational Semantics, Declarative Semantics, x86-TSO

### ACM Reference Format:

Guillaume Ambal, Brijesh Dongol, Haggai Eran, Vasileios Klimis, Ori Lahav, and Azalea Raad. 2024. Semantics of Remote Direct Memory Access: Operational and Declarative Models of RDMA on TSO Architectures. Proc. ACM Program. Lang. 8, OOPSLA2, Article 341 (October 2024), [28](#page-27-0) pages. <https://doi.org/10.1145/3689781>

# 1 Introduction

Remote direct memory access (RDMA) technologies such as InfiniBand and RDMA over Converged Ethernet (RoCE) enable a machine to have direct read/write access to the memory of another machine over a network, bypassing the operating systems of both machines. This way, remote reads and writes are performed with far fewer CPU cycles, leading to high-throughput, low-latency

<sup>∗</sup>Corresponding author

Authors' Contact Information: [Guillaume Ambal,](https://orcid.org/0000-0002-4667-7266) Imperial College London, UK, g.ambal@imperial.ac.uk; [Brijesh Dongol,](https://orcid.org/0000-0003-0446-3507) University of Surrey , UK, b.dongol@surrey.ac.uk; [Haggai Eran,](https://orcid.org/0000-0002-2159-9046) NVIDIA, Israel, haggaie@nvidia.com; [Vasileios Klimis,](https://orcid.org/0000-0002-3173-8636) Queen Mary University of London, UK, v.klimis@qmul.ac.uk; [Ori Lahav,](https://orcid.org/0000-0003-4305-6998) Tel Aviv University, Israel, orilahav@tau.ac.il; [Azalea Raad,](https://orcid.org/0000-0002-2319-3242) Imperial College London, UK, azalea.raad@imperial.ac.uk.



[This work is licensed under a Creative Commons Attribution 4.0 International License.](https://creativecommons.org/licenses/by/4.0/)

© 2024 Copyright held by the owner/author(s). ACM 2475-1421/2024/10-ART341 <https://doi.org/10.1145/3689781>

networking, which is especially useful in massively parallel computer clusters, e.g. for data centres, big data, and scientific computation. Thanks to implementations that offer higher performance at a comparable cost over traditional networking infrastructure (e.g. TCP/IP sockets) [\[Gerstenberger](#page-25-0) [et al.](#page-25-0) [2018\]](#page-25-0), RDMA has achieved widespread adoption as of 2018 [\[Shpiner et al.](#page-26-0) [2017\]](#page-26-0) and has been rapidly adopted in modern data centres.

At the lowest level, RDMA networks directly interact with the hardware through calls to read (get) and write (put) operations to remote memory locations. As a result, programming RDMA systems is conceptually similar to shared memory systems of existing hardware architectures such as Intel-x86 or ARM. A key difference, however, is that when a machine encounters a remote operation, the CPU forwards it onto the network interface card (NIC), which subsequently handles the remote operation and its associated memory accesses without further CPU involvement.

There is a wide range of RDMA implementations, starting from the Virtual Interface Architecture (VIA) [\[Dunning et al.](#page-25-1) [1998\]](#page-25-1), later adapted by InfiniBand [\[IBTA](#page-25-2) [2022\]](#page-25-2) and RoCE [\[InfiniBand](#page-25-3) [Trade Association \(IBTA\)](#page-25-3) [2018\]](#page-25-3). Other standards include iWarp [\[Recio et al.](#page-26-1) [2007\]](#page-26-1) and Omni-Path [\[Birrittella et al.](#page-24-0) [2015\]](#page-24-0). Some transfer technologies without NICs, such as FireWire [\[Anderson](#page-24-1) [1999\]](#page-24-1), can also be considered RDMA, but they do not scale to wide networks. In most implementations, an RDMA NIC implements the transport in hardware and is controlled by software through APIs such as Verbs [\[linux-rdma](#page-25-4) [2018\]](#page-25-4) or libfabric [\[OpenFabrics](#page-25-5) [2016\]](#page-25-5). A NIC typically connects to the host CPU through an internal server fabric such as PCIe, though in some cases the NIC and compute cores can be more tightly integrated, e.g. in academic proposals [\[Novakovic et al.](#page-25-6) [2014\]](#page-25-6) and in DPUs [\[NVIDIA Corporation](#page-25-7) [2021\]](#page-25-7). In addition, there have been proposals for next-gen fabrics to replace PCIe (e.g. CXL [\[Van Doren](#page-26-2) [2019\]](#page-26-2)).

Our focus here is on the IB Verbs model defined by [IBTA](#page-25-2) [\[2022\]](#page-25-2), using PCIe as the internal fabric. It was designed for InfiniBand and reused for RoCE, the two most popular RDMA technologies. While the specification for the iWarp protocol is slightly more permissive, its main implementation (libfabric) follows the stronger semantics presented here. Lastly, the more recent Omni-Path fabric also has legacy support for IB Verbs.

The performance gains of RDMA, as well as its wide range of implementations, have led to a surge of both theoretical and practical RDMA research [\[Aguilera et al.](#page-24-2) [2019;](#page-24-2) [Dan et al.](#page-25-8) [2016;](#page-25-8) [Wei](#page-26-3) [et al.](#page-26-3) [2015;](#page-26-3) [Zhu et al.](#page-27-1) [2015\]](#page-27-1). However, as we discuss below, programming RDMA systems correctly is not straightforward, and the RDMA community would benefit greatly from formal models and rigorous techniques for reasoning about RDMA programs.

A key challenge lies in understanding the different degrees of concurrency and their interaction thereof. More concretely, a program may comprise threads that run over multiple nodes (machines) over the network (inter-node concurrency), with each node itself executing several threads (intranode concurrency). As such, to understand the behaviour of a concurrent RDMA program, one must understand how remote and local operations on different nodes interact with one another. The problem is that local operations are handled by the CPU, while remote operations are handled by the NIC independently and in parallel to other CPU operations. Consequently, two sequential operations may not be executed in the intended (program) order, leading to surprising outcomes.

To understand the behaviour of RDMA programs, we must understand the order in which CPU and remote operations are executed, how they may be reordered, and how and when their effects are made visible to concurrent threads, be they on the same or different nodes. Specifically, much in the same way that the semantics of multi-processor hardware architectures such as Intel-x86, POWER, and ARM have been described via formal consistency models (a.k.a. memory models) [\[Alglave et al.](#page-24-3) [2021,](#page-24-3) [2014;](#page-24-4) [Cho et al.](#page-24-5) [2021;](#page-24-5) [Mador-Haim et al.](#page-25-9) [2012;](#page-25-9) [Pulte et al.](#page-26-4) [2018,](#page-26-4) [2019;](#page-26-5) [Raad et al.](#page-26-6) [2022;](#page-26-6) [Raad](#page-26-7) [and Vafeiadis](#page-26-7) [2018;](#page-26-7) [Raad et al.](#page-26-8) [2020b,](#page-26-8) [2019b;](#page-26-9) [Sarkar et al.](#page-26-10) [2011;](#page-26-10) [Sewell et al.](#page-26-11) [2010\]](#page-26-11), we should ideally describe the semantics of RDMA programs formally. Such formal models not only provide a rigorous

underpinning for reasoning about the behaviour of programs, they have also been historically successful at identifying mistakes and ambiguities in the existing hardware reference manuals, as well as compilation bugs [\[Alglave et al.](#page-24-3) [2021,](#page-24-3) [2014;](#page-24-4) [Lahav et al.](#page-25-10) [2017;](#page-25-10) [Pulte et al.](#page-26-4) [2018;](#page-26-4) [Raad et al.](#page-26-8) [2020b\]](#page-26-8).

Unfortunately, the existing literature includes next to no work on the formal semantics of RDMA programs. Indeed, to our knowledge, the coreRMA model by [Dan et al.](#page-25-8) [\[2016\]](#page-25-8) is the only one to offer a formal description of RDMA programs. However, this work has four key shortcomings.

First, coreRMA assumes that CPU concurrency on each node is governed by the sequential consistency (SC) [\[Lamport](#page-25-11) [1979\]](#page-25-11) model. This is an unrealistic assumption as no existing CPU architecture supports SC by default, and the two mainstream CPU architectures, Intel-x86 and ARM, are both subject to weaker models that exhibit behaviours not possible under SC. This is a significant gap since Intel-x86 and ARM architectures are ubiquitous.

Second, coreRMA describes the semantics *only declaratively* (i.e. via execution graphs that are subject to a set of axioms stipulating the absence of certain cycles in the graphs) and not operationally (via transitions that describe how the underlying hardware processes each operation and manipulates the memory). While declarative models are common in the literature of weak memory models and are more concise, operational models provide a more intuitive account of the hardware guarantees (as they prescribe a step-by-step mechanism for producing the behaviours of a program). Moreover, having two characterisations is useful not only for ensuring the accuracy of the formalism, but also because each formulation may be more useful for establishing different results. In particular, operational models are better suited for underpinning program logics and checking the reachability of an erroneous configuration and/or robustness for finite-state programs with loops (e.g. [\[Abdulla et al.](#page-24-6) [2021;](#page-24-6) [Bouajjani et al.](#page-24-7) [2013;](#page-24-7) [Lahav and Boker](#page-25-12) [2020\]](#page-25-12)).

Third, the coreRMA authors have failed to *validate* their model against existing implementations in that they could not observe  $any$  of the weak behaviours allowed by coreRMA on existing implementations. That is, they could not practically justify the weakness of coreRMA.

Fourth and most importantly, as we discuss in detail in [§6,](#page-22-0) coreRMA is not faithful to the RDMA specification [\[IBTA](#page-25-2) [2022\]](#page-25-2) and departs from it in three different ways. In particular, coreRMA is neither stronger nor weaker than the specification, meaning that it admits certain behaviours disallowed by the specification, while prohibiting others allowed by the specification.

To close these gaps, we present  $RDMA^{TSO}$ , the *first* formal semantics of RDMA programs in the context of the x86 architecture, which implements the TSO model [\[Sewell et al.](#page-26-11) [2010\]](#page-26-11). We describe RDMA<sup>TSO</sup> both *operationally* and *declaratively* and prove that the two are *equivalent*. Specifically, we first develop two operational models of RDMA<sup>TSO</sup>: (1) a concrete model, reflecting the hardware structure for propagating data across the network; and (2) a simplified model, abstracting away the hardware details, resulting in a cleaner model. We prove that our two operational characterisations of RDMA<sup>TSO</sup> are equivalent and mechanise our proof in Coq. We then present a declarative model of RDMA<sup>TSO</sup> and show that it is equivalent to our simplified (and thus also concrete) operational model.

We have developed  $R_{\rm DMA}^{TSO}$  in close collaboration with engineers at NVIDIA, the largest manufacturer of networking products worldwide (after acquiring Mellanox in 2019). In particular, we have discussed all weak behaviours admitted by RDMA<sup>TSO</sup> with the engineers and have reflected the hardware justification for such behaviours in our concrete semantics. To further increase confidence in the fidelity of RDMA<sup>TSO</sup> to the specification, we have *empirically validated* it via extensive testing on existing implementations. More specifically, through our empirical validation we have managed to establish that (1)  $R_{\text{DMA}}^{\text{TSO}}$  is not too strong: we did not observe any of the behaviours prohibited by RDMA<sup>TSO</sup> on existing implementations; and (2) RDMA<sup>TSO</sup> is not *too weak*: we managed to observe almost all weak behaviours allowed by  $R_{\text{DMA}}^{\text{TSO}}$ , on existing implementations, and in the few cases where we did not observe a weak behaviour allowed by  $R_{\rm{DMA}}^{\rm{TSO}}$ , the engineers at NVIDIA

<span id="page-3-1"></span>

Fig. 1. TSO litmus tests for CPU concurrency, where locations  $x, y$  are accessed by all threads, while locations a, b are accessed by one thread only, and  $x = y = 0$  on the first line denotes that x, y initially hold value 0.

confirmed that current implementations explicitly do not utilise the weakness admitted by the RDMA specification (see [§5\)](#page-20-0).

**Contributions and Outline**. In [§2](#page-3-0) we present an intuitive account of RDMA<sup>TSO</sup> through a number of examples. In  $\S$ 3 we present our concrete and simplified operational semantics of RDMA<sup>TSO</sup> and show that they are equivalent. In  $\S 4$  we present our declarative semantics of  $RDMA^{TSO}$  and show that it is equivalent to our simplified operational semantics. In [§5](#page-20-0) we describe how we empirically validated RDMA<sup>TSO</sup> through extensive litmus testing. We discuss related and future work in [§6.](#page-22-0)

Additional Material. The full proofs of all stated theorems are given in the extended version [\[Ambal et al.](#page-24-8) [2024a\]](#page-24-8) and accompanying Coq development [\[Ambal et al.](#page-24-9) [2024b\]](#page-24-9). We provide the executable RDMA code (in machine-readable format) and detailed instructions for replicating our experiments and analysing our litmus tests [\[Ambal et al.](#page-24-9) [2024b\]](#page-24-9).

### <span id="page-3-0"></span>2 Overview

We present an account of the formal RDMA semantics,  $R<sub>DMA</sub><sup>TSO</sup>$ , through a number of examples. We model concurrent programs running over a network of machines, and hereafter refer to each machine on the network as a *node*. In our setting, the semantics of a concurrent program and thus its possible weak behaviours are determined by two factors: (1) the origin of the threads, i.e. whether all threads originate from (are forked by) the same node and thus the concurrency is intra-node (within one node), or they originate from several nodes and thus concurrency is inter-node (across two or mode nodes); and (2) the memory targeted by the threads, i.e. whether each thread accesses its own local memory (on the same node), that of other nodes, or a combination thereof.

Litmus Test Outcome Notation. In the remainder of this article, we present small representative examples (known as litmus tests in the literature) to illustrate whether an outcome is allowed by a given model (e.g. in Fig. [1,](#page-3-1) Fig. [2](#page-4-0) and Fig. [3\)](#page-6-0), and annotate a given outcomes with: (1)  $\checkmark$ , to denote that the outcome is *allowed* by the model *and observed* in practice (in our empirical validation); (2)  $\checkmark$ , to denote that the outcome is allowed by the model and not observable in practice; (3)  $\chi$ , to denote that the outcome is *disallowed* by the model *and not observed* in practice. See [§5](#page-20-0) or the extended version for more details.

CPU Concurrency and TSO. Existing work on RDMA semantics [\[Dan et al.](#page-25-8) [2016\]](#page-25-8) assumes that CPU concurrency on each node is governed by the strong and unrealistic sequential consistency (SC) model [\[Lamport](#page-25-11) [1979\]](#page-25-11). We relax this assumption here and instead model each node as an x86 machine, subject to the TSO memory model [\[Sewell et al.](#page-26-11) [2010\]](#page-26-11) introduced by the SPARC architecture [\[SPARC](#page-26-12) [1992\]](#page-26-12). Under TSO, a later read (in program order) can be reordered before an earlier write on a different location. This is illustrated in the store buffering example of Fig. [1a,](#page-3-1) where x and  $y$  denote locations accessed by both threads, while  $a$  and  $b$  denote locations accessed by

<span id="page-4-0"></span>

Fig. 2. Sequential RDMA litmus tests (excerpt), where each column (separated by ||) denotes a distinct node, the statement on the top line of each column denotes the initial values of locations, and the statements in the caption express whether each outcome is allowed by RDMA<sup>TSO</sup> and observed in practice (✓), or disallowed by  $RDMA^{TSO}$  and not observed in practice  $(X)$ ; i.e. we have empirically validated all outcomes shown.

one thread only.<sup>[1](#page-4-1)</sup> Specifically, the reads  $a := y$  and  $b := x$  can respectively be reordered before the writes  $x := 1$  and  $y := 1$ , allowing them to read the initial value 0 from y and x, yielding  $a = b = 0$ at the end of execution. Note that this weak behaviour is not allowed under the stronger SC model as SC admits no instruction reordering.

To prevent such write-read reordering, one can use an mfence as in Fig. [1b:](#page-3-1) mfence instructions cannot be reordered in either direction, and thus the weak behaviour shown is no longer possible. Indeed, other than write-read reordering, TSO admits no other reorderings and thus other weak behaviours, e.g. load buffering in Fig. [1c](#page-3-1) and message passing in Fig. [1d](#page-3-1) are prohibited under TSO.

Remote Direct Memory Access (RDMA). RDMA allows one to build a network of communicating nodes (machines), where each node can directly access remote memory (of other nodes) through its network interface card (NIC). RDMA networks are programmed via operations that read from and write to remote memory, as well as various synchronisation operations. As such, programming RDMA networks is conceptually similar to shared memory systems such as TSO.

To distinguish remote (RDMA) operations from CPU ones, we refer to RDMA reads and writes as get and *put* operations, respectively. Moreover, to distinguish local and remote memory locations, we write  $x^n$  for a memory location on a remote node *n*, and write x for a memory location on the current local node. A put operation is of the form  $x^n := y$ , and consists of reading from a local memory location  $y$  (referred to as a 'NIC local read') and writing to a remote memory location  $x$ on node  $n$  (a 'NIC remote write' or simply a 'remote write'). Similarly, a get operation is of the form  $x := y^n$ , and consists of reading from a remote memory location y on node n (a 'NIC remote read' or a 'remote read') and writing to a local memory location  $x$  (a 'NIC local write'). We write  $\overline{n}$  to identify a node other than *n*. When a thread on local node *n* issues a remote operation to be executed on remote node  $\overline{n}$ , we denote this by stating that the operation is by *n* towards  $\overline{n}$ .

Sequential (Single-Threaded)  $R_{\text{IDMA}}^{\text{TSO}}$  Behaviours. When a thread issues a put or get operation, it is handled by the NIC subsystem (and its associated queue pairs and buffers as shown

<span id="page-4-1"></span><sup>&</sup>lt;sup>1</sup>In our general model, all memory locations are shared and thus can be accessed by all threads both locally (on the same node) and remotely. However, for better readability, we follow the convention of naming locations accessed by multiple threads (locally or remotely) as  $x$ ,  $y$ ,  $z$  and  $w$ , while naming locations accessed by a single local thread as  $a$ ,  $b$ ,  $c$  and  $d$ .

in Fig. [4\)](#page-9-0), in contrast to the CPU operations which are handled by the processor subsystem (and its associated store buffers). As such, the interaction between CPU and remote operations lead to further behaviours even within a sequential (single-threaded) program. We demonstrate this in the examples of Fig. [2,](#page-4-0) where each column represents a distinct node, numbered from 1 onwards. For instance, the example in Fig. [2a](#page-4-0) comprises a single thread on node 1 (the left-most column) that writes to the local location  $x$  ( $x := 1$ ) and puts  $x$  towards the remote location  $z$  on node  $2$  ( $z^2 := x$ ).

Intuitively, when a thread t on *n* issues remote operations towards node  $\overline{n}$ , one can view these remote operations as if being executed by a thread running in parallel to t. As such, when a remote operation follows a CPU one, the order of the two operations is preserved since the parallel thread is spawned only after the CPU operation is executed. This is illustrated in Fig. [2a:](#page-4-0) as  $\overline{z}^2 \coloneqq x$  follows  $x := 1$ , it observes the  $x := 1$  write and thus puts value 1 to z; i.e. outcome  $z = 0$  is not permitted.

By contrast, when a remote operation precedes a CPU one, the remote operation is performed by a 'separate thread' run in parallel to the later CPU operation in the main thread, and thus may execute before or after the CPU operation, meaning that in the latter case the execution order is not preserved. This is illustrated in Fig. [2b,](#page-4-0) where the earlier  $z^2 := x$  may execute (be reordered) after the later  $x := 1$ , and thus both  $z = 0$  and  $z = 1$  outcomes are possible.

Therefore, before using the result of a get or reusing the memory location of a put, it may be desirable to avoid such reordering and to wait for the remote operation to complete. This can be done through a CPU poll operation,  $pol(n)$ , that blocks until the *earliest* (in program order) remote operation towards node *n* has completed.<sup>[2](#page-5-0)</sup> This is shown in Fig. [2c,](#page-4-0) obtained from Fig. [2b](#page-4-0) by inserting a poll after the remote operation:  $\text{pol}(2)$  waits for  $z^2 := x$  to complete before proceeding with  $x := 1$ , and thus  $z^2 := x$  can no longer be reordered after  $x := 1$ , prohibiting the  $z = 1$  outcome.

Note that each poll $(n)$  waits for *only one* (the earliest in program order) and *not all* pending remote operation towards  $n$  to complete. This allows for more efficient and fine-grained control over remote operations, but requires some care. For instance, consider the example in Fig. [2d,](#page-4-0) where poll(2) blocks until the first  $z^2 := x$  is complete, and thus the second  $z^2 := x$  operation can be reordered after the later  $x = 1$ , once again allowing for the  $z = 1$  outcome.

Two remote operations towards different nodes are fully independent and can execute in either order (as if within two separate threads). For instance, the two get operations in Fig. [2e](#page-4-0) can execute in either order, and thus the final value of x may be either that of  $z$  ( $x = 1$  outcome) or that of y (the  $x = 2$  outcome). Similarly, Fig. [2f](#page-4-0) has two possible outcomes. The only way to enforce the execution order is polling the first remote operation before running the second.

The ordering guarantees on remote operations towards the *same* node are stronger and only certain reorderings are allowed. Recall that a remote (NIC) put operation  $x^n := y$  comprises two steps: a NIC local read (obtaining the value of  $y$ ) and a NIC remote write (writing the value of  $y$  to  $x^n$ ). Similarly, a remote get operation  $x := y^n$  comprises two steps: a NIC remote read (obtaining the value of  $y^n$ ) and a NIC local write (writing the value of  $y^n$  to x). Intuitively, for remote operations on a given location  $x$ , these four steps mandate a *precedence* which in turn determines whether they can be reordered. Specifically, the four steps described above give way to the following precedence order: i) NIC local read; ii) NIC remote write; iii) NIC remote read; iv) NIC local write.

If a step with a higher precedence (e.g. a NIC local read) is in program order before one with a lower precedence (e.g. a NIC local write), then their order is preserved and they cannot be reordered; otherwise the order is not necessarily preserved and these steps can be reordered. For instance, in the Fig. [2g](#page-4-0) example, the earlier NIC local read on x (in  $z^2 := x$ ) has a higher precedence than the

<span id="page-5-0"></span> $^2$ In the Verbs API, pol $1(n)$  returns a value denoting whether the operation being polled has completed and does not block the execution. However, it is common practice for calls to poll to be placed in a spin loop that returns only when the operation is completed, hence blocking the execution. Here, we model this common pattern with stronger behaviours.

<span id="page-6-0"></span>

| $y=0$                      | $x=0$ |                                                                                                                                                                                                             | $x=0$    $y=0$ | $x=0$    $y=0$                                                                                                                                                                                                                                                                                                                                                                           |                   | $y=0$    $x=0$                       |                                                                                     | $y=w=0$   $x=z=0$ |  |
|----------------------------|-------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|--------------------------------------|-------------------------------------------------------------------------------------|-------------------|--|
|                            |       | $\begin{vmatrix} x^2 := 1 \\ a := y \end{vmatrix}$ $\begin{vmatrix} y^1 := 1 \\ b := x \end{vmatrix}$ $\begin{vmatrix} a := y^2 \\ x := 1 \end{vmatrix}$ $\begin{vmatrix} b := x^1 \\ y := 1 \end{vmatrix}$ |                | $\begin{vmatrix} a := y^2 \\ \text{pol1}(2) \end{vmatrix}$ $\begin{vmatrix} b := x^1 \\ \text{pol1}(1) \end{vmatrix}$ $\begin{vmatrix} x^2 := 1 \\ \text{pol1}(2) \end{vmatrix}$ $\begin{vmatrix} y^1 := 1 \\ \text{pol1}(1) \end{vmatrix}$ $\begin{vmatrix} c := z^2 \\ \text{pol1}(2) \end{vmatrix}$ $\begin{vmatrix} d := w^1 \\ \text{pol1}(1) \end{vmatrix}$<br>$x := 1$ $  y := 1$ | $ a:=y $ $ b:=x $ |                                      | $x^2 := 1$ $  y^1 := 1$<br>$ {\rm poll}(2)   {\rm poll}(1) $<br>$a := y$ $  b := x$ |                   |  |
| (a) $a = b = 0 \checkmark$ |       | (b) $a = b = 1$                                                                                                                                                                                             |                | (c) $a = b = 1 X$                                                                                                                                                                                                                                                                                                                                                                        |                   | (d) $a = b = 0 \blacktriangledown^*$ |                                                                                     | (e) $a = b = 0 X$ |  |

Fig. 3. Concurrent RDMA litmus tests (excerpt). The annotations in the captions indicate the given outcome is allowed by  $R_{\text{DMA}}^{\text{TSO}}$  and observed in our empirical validation ( $\checkmark$ ), allowed by  $R_{\text{DMA}}^{\text{TSO}}$  and not observable in practice ( $\checkmark$ \*), or disallowed by RDMA<sup>TSO</sup> and not observed in our validation ( $\checkmark$ ). That is, we have empirically validated all outcomes shown, except that in [\(d\)](#page-6-0), which is due to the underlying implementation explicitly not utilising the weakness admitted by the specification – see [§5](#page-20-0) or the extended version for more details.

later NIC local write on  $x$  (in  $x := y^2$ ), and thus the order of these steps on  $x$  is preserved; i.e. the old value (1) of x is written to  $z^2$  leading to the  $z = 1$  outcome, and the  $z = 0$  outcome is disallowed.

By contrast, in the Fig. [2h](#page-4-0) example, the earlier NIC local write on  $x$  (in  $x := y^2$ ) has a lower precedence than the later NIC local read on x (in  $z^2 := x$ ) and thus the two steps can be reordered. Besides the SC outcome ( $z = 0$ ), the program might execute the NIC local read on x before the NIC local write on x, thereby reading the initial value 1 from x and writing it to z (outcome  $z = 1$ ).

As before, the reordering of the two remote operations in Fig. [2h](#page-4-0) can be prevented by polling the first operation before the second operation. However, polling is costly as it leads to global blocking: it blocks the current thread both on the CPU and the NIC towards all nodes (i.e. the current thread cannot execute any remote operations on  $any$  node). Alternatively, we can use a remote fence<sup>[3](#page-6-1)</sup> operation, rfence  $(n)$ , that blocks the current thread only on the NIC and only towards node  $n$ . (i.e. the thread cannot execute any remote operations towards  $n$ , but can execute both on the CPU as well as on the NIC towards nodes other than  $n$ ). This in turn ensures that earlier (in program order) remote operations by the thread towards  $n$  (those before the fence) are executed before later remote operations towards  $n$  (those after the fence). This is illustrated in Fig. [2i,](#page-4-0) obtained from Fig. [2h](#page-4-0) by inserting rfence (2) between the two remote operations towards node 2, thereby ensuring that they are executed in order, and thus  $z = 1$  is no longer possible.

<span id="page-6-2"></span>Concurrent (Multi-threaded) RDMA<sup>TSO</sup> Behaviours. The real power of RDMA comes from multiple programs running on different nodes. This introduces a wide range of weak behaviours, as we describe below. A network can comprise several nodes, each running several concurrent threads. Here, we focus on a few examples each with two nodes, with each node comprising a single thread.

The store buffering behaviour due to the TSO model discussed in Fig. [1a](#page-3-1) is also possible in the RDMA setting, i.e. when locations x and  $y$  are on two different nodes, as shown in Fig. [3a.](#page-6-0) Moreover, further weak behaviours not possible under TSO, e.g. load buffering in Fig. [1c,](#page-3-1) are permitted in the RDMA setting, as shown in Fig. [3b.](#page-6-0) As before, this can be intuitively justified by conceptually viewing the remote operations in each thread as being executed by a separately spawned thread.

Most weak behaviours such as load buffering in Fig. [3b](#page-6-0) can be prevented by polling the remote operations as needed, as shown in Fig. [3c.](#page-6-0) Specifically, the poll operations in Fig. [3c](#page-6-0) await the completion of the preceding get operations, and thus the earlier get operations cannot be reordered after the later writes, thus prohibiting the weak behaviour  $a = b = 1$ . However, a notable exception to this is the store buffering weak behaviour which cannot be prevented even when polling the

<span id="page-6-1"></span> $3$ For InfiniBand Verbs, this remote fence is not an independent operation but a flag that can be set on the later operation.

remote operations, as shown in Fig. [3d.](#page-6-0) This is because the specification of polling offers different guarantees for get and put operations. Specifically, polling a get operation  $a := x^2$  offers a strong guarantee and behaves intuitively: polling ensures that the get operation is complete (i.e. the value of  $x^2$  is fetched from node 2), and the executing thread performs the associated NIC local write on a before marking the operation as complete and proceeding with the remainder of the execution. By contrast, polling a put operation  $x^2 := 1$  offers a weaker guarantee: when sending value 1 towards node 2 to be put in  $x^2$ , the remote NIC merely *acknowledges* having received the data (value 1), but this data may still reside in a buffer (i.e. the PCIe fabric) of the remote node, pending to be written to the remote memory. Polling a put operation then simply awaits the acknowledgement of the data receipt and not its full completion (the data being written to memory). As such, it is possible to poll a put operation successfully before the associated remote write has fully completed. In the case of store buffering in Fig. [3d,](#page-6-0) it is possible for both poll operations to complete before the values of x and y are updated (to 1) in memory, and thus for the later reads to read their old values (0). However, note that existing implementations on current hardware (including ones against which we validated our model,) do not utilise this flexibility admitted by the specification. As such, the weak behaviour in Fig. [3d](#page-6-0) is not observable in practice, and thus we could not observe them in our validation effort (as indicated by  $\checkmark$ ) – see [§5.](#page-20-0) Nevertheless, since our aim is to capture the specification and not the *implementation*, we have modelled  $R_{\text{DMA}}^{\text{TSO}}$  to allow this weak behaviour.

The behaviours discussed thus far all hold of the general RDMA specification [\[IBTA](#page-25-2) [2022\]](#page-25-2) as well as the PCIe standard [\[PCI-SIG](#page-25-13) [2022\]](#page-25-13). However, in certain cases the PCIe standard offers stronger guarantees than those delineated by the RDMA specification. In particular, PCIe does not allow a read to fetch a pending value that has not yet been committed to memory. As such, a NIC remote read flushes (commits) all pending NIC remote writes to memory, while a NIC local read flushes all pending NIC local writes to memory. Interestingly, we can use this guarantee to prevent weak behaviours such as store buffering (which, in theory, cannot be prevented even via polling). Specifically, recall that polling a put only ensures that the data transmitted has reached the remote node and may not have yet been committed to its memory. However, by polling a later get (towards the same remote node) we can ensure the previous NIC remote writes (including that of the recently polled put) have been committed to the remote memory. An example of this is shown in Fig. [3e,](#page-6-0) obtained from Fig. [3d](#page-6-0) by adding additional gets ( $c:=z^2$  and  $d:=w^1\overline{)}$  and subsequently polling them. Reading from  $z^2$  and  $w^2$  in effect flushes the pending NIC remote writes on both nodes, ensuring that the effects of the earlier puts ( $x^2 := 1$  and  $y^1 := 1$ ) are committed to memory, which in turn ensures that the later  $a := y$  and  $b := x$  reads observe the updates on  $y$  and  $x$  (value 1), thus prohibiting  $a = b = 0$ .

As mentioned above, this guarantee is PCIe-specific, and not mentioned by the RDMA standard. However, as PCIe is the de facto standard for RDMA programming, and since all widely available RDMA hardware is PCIe-compatible, here we opt to model this guarantee, resulting in a stronger model. Nevertheless, in [§3](#page-7-0) and [§4](#page-15-0) we also describe how we can weaken our models by removing this guarantee, both in our operational and declarative semantics.

Our aim here was to provide an overview of the weak RDMA behaviours both in the sequential and concurrent settings. We refer the reader to the extended version [\[Ambal et al.](#page-24-8) [2024a\]](#page-24-8) for further examples of weak behaviours.

### <span id="page-7-0"></span>3 RDMA<sup>TSO</sup> Concrete and Simplified Operational Semantics

We begin with several preliminary concepts. We present an informal account of our concrete semantics [\(§3.1\)](#page-8-0), our formal concrete semantics [\(§3.2\)](#page-11-0) and our equivalent simplified semantics [\(§3.3\)](#page-15-1).

Nodes and Threads. We consider a system with  $N$  nodes (machines) and  $M$  threads in total across all machines. Let Node =  $\{1, \ldots, N\}$  be the set of *node identifiers*, and Tid =  $\{1, \ldots, M\}$  be the set of *thread identifiers*. We use  $n$  and  $t$  to range over Node and Tid, respectively. Given a node n, we write  $\overline{n}$  to range over Node\  $\{n\}$ . Each thread  $t \in T$ id is associated with a node, written  $n(t)$ . Note that multiple threads may run on the same node.

Memory Locations. Each node  $n$  has a set of locations,  $Loc_n$ , accessible by all nodes. We define Loc  $\triangleq \biguplus_n \text{Loc}_n$  and  $\text{Loc}_{\overline{n}} \triangleq \text{Loc} \setminus \text{Loc}_n$ . We use  $x^n, y^n, z^n$  and  $x^{\overline{n}}, y^{\overline{n}}, z^{\overline{n}}$  to range over  $\text{Loc}_n$  and  $\text{Loc}_{\overline{n}}$ , respectively. When the choice of *n* is clear, we write *x* for  $x^n$ ; similarly for  $\overline{n}$ . For clarity, we use distinct location names across nodes, and thus write  $n(x)$  for the unique  $n \in N$ ode where  $x \in \text{Loc}_n$ .

**Values and Expressions**. We assume a set of values, Val, with  $\mathbb{N} \subseteq$  Val, and use v to range over Val. We assume a language of expressions over Val and Loc, and elide its exact syntax and semantics (as these are standard). We denote by Exp the set of all expressions and use  $e$  to range over Exp. We write  $\llbracket e \rrbracket$  for the semantic evaluation of a *closed* expression  $e$  (i.e. one without any locations), elocs(e) for the locations in an expression e, and  $\epsilon[\nu/x]$  for the expression obtained from e after substituting all occurrences of x by v. We use  $e^n$  to range over expressions with elocs( $e^n$ )  $\subseteq$  Loc<sub>n</sub>.

Sequential Commands and Programs. Sequential programs running on node  $n$  are described by the C<sup>n</sup> grammar below and include primitive commands (c<sup>n</sup>), sequential composition (C<sup>n</sup>  $\binom{n}{1}$ ; C<sub>2</sub>  $\binom{n}{2}$ , non-deterministic choice  $(C_1^n)$  $n_1^n + C_2^n$  $n_2^n$ , executing either  $C_1^n$  $\binom{n}{1}$  or  $\binom{n}{2}$  $\binom{n}{2}$  and non-deterministic loops ( $\binom{n}{k}$ , executing  $C^n$  for a finite, possibly zero, number of iterations). A (concurrent) program, P, is a map from thread identifiers to commands, associating each thread  $t \in T$ id with a command on node  $n(t)$ .

Comm ∋  $C^n$  ::= skip |  $c^n$  |  $C_1^n$  $\binom{n}{1}$ ; C<sub>2</sub>  $\binom{n}{2}$  | C<sub>1</sub>  $n_1^n + C_2^n$  $\binom{n}{2}$  | C  $n^*$  PComm  $\Rightarrow$  c<sup>n</sup> ::= cc<sup>n</sup> | rc<sup>n</sup>  $CComm \ni cc^n ::= x := e^n \mid \text{assume}(x = v) \mid \text{assume}(x \neq v) \mid \text{mfence} \mid x := CAS(y, e_1, e_2) \mid \text{poll}(\overline{n})$ RComm∋rc $^n ::= x := \overline{y} | \overline{y} := x |$ rfence  $(\overline{n})$ 

Primitive commands include  $CPU$  (cc<sup>n</sup>) and  $RDMA$  (rc<sup>n</sup>) operations. A CPU operation on n may be a no-op (skip), an assignment to a local location ( $x := e$ ), an assumption on the value of a local location (assume( $x = v$ ) and assume( $x \neq v$ )), a memory fence (mfence), an atomic CAS ('compare-and-set') operation  $(x := CAS(y, e_1, e_2))$ , or a 'poll', poll $(\overline{n})$ , that awaits the completion notification of the earliest put/get that is pending (not yet acknowledged). An RDMA operation may be (i) a 'get',  $x := \overline{y}$ , reading from remote location  $\overline{y}$  and writing the result to local location x; (ii) a 'put',  $\overline{y} := x$ , reading from local location x and writing the result to remote location  $\overline{y}$ ; or (iii) an 'RDMA fence', r fence  $(\bar{n})$ , which ensures that all later (in program order) RDMA operations towards  $\overline{n}$  will await the completion of all earlier RDMA operations towards  $\overline{n}$ . Note that poll  $(\overline{n})$ is executed by the CPU and blocks its thread (and prevents the requests of later remote operations), while rfence  $(\overline{n})$  blocks the NIC for the execution of remote operations towards node  $\overline{n}$ . In what follows, we write  $\bar{x} := 1$  as a shorthand for  $\bar{x} := a$  for some local location a containing value 1.

### <span id="page-8-0"></span>3.1 RDMA<sup>TSO</sup> Concrete Operational Semantics at a Glance

**RDMA<sup>TSO</sup>** Architecture and CPU Operations. Conceptually, the RDMA<sup>TSO</sup> architecture is as shown at the top-right of Fig. [4,](#page-9-0) where for brevity we depict two nodes, each comprising  $(1)$  m threads and their associated FIFO (first-in-first-out) store buffers; (2) a network interface card (NIC) and its PCIe root complex; and (3) the memory. Store buffers are used to model write-read reordering on TSO, which account for the weak behaviour in Fig. [1a.](#page-3-1) Specifically, executing a write on TSO comprises two steps: (i) when a thread issues a write, the write is only recorded in its store buffer; (ii) writes in the buffer are debuffered (in FIFO order) and propagated to the local memory at a later point. When a thread issues a read from a location  $x$ , it first consults its own store buffer. If it

<span id="page-9-0"></span>

Fig. 4. RDMA<sup>TSO</sup> architecture overview (top, right); a possible RDMA network configuration with three nodes and four thread (left); the queue-pair structure (below, right)

contains a write for  $x$ , the thread reads the value of the latest such write; otherwise, the thread reads the value of x from its local memory. In other words, one can model the reordering of a write  $w$ after a later read  $r$  by *delaying* the debuffering of  $w$  until after  $r$  has executed. Moreover, executing an mfence or a CAS debuffers all its delayed writes in the store buffer and propagates them to memory (in FIFO order), thus preventing write-read reordering. That is, the only CPU operations that a store buffer contains are delayed writes. We describe the execution of polls shortly.

Remote Operations and Queue Pairs. To model the communication between the nodes, each thread  $t$  has a distinct *queue pair* for each remote node whose memory  $t$  accesses. For instance, the network configuration in the left of Fig. [4](#page-9-0) comprises three nodes with four threads and with the queue pairs depicted as 'horse-shoes', where e.g. thread T2 in the bottom left node accesses the memories of the other two nodes, and it does so through two distinct queue pairs shown underneath T2. We refer to the queue pair of a thread towards node  $\overline{n}$  as its  $\overline{n}$ -queue pair.

The details of a queue-pair structure is shown at the bottom-right of Fig. [4,](#page-9-0) where each queue pair of thread t is connected to the store buffer of t. As shown, an  $\bar{n}$ -queue pair comprises six FIFO buffers—req<sub>L</sub>, in<sub>R</sub>, wb<sub>R</sub>, out<sub>R</sub>, rsp<sub>L</sub>, and wb<sub>L</sub>—that we describe below. A queue pair contains pending operations if any of its  $in_R$ , out<sub>R</sub> or  $rsp_L$  components is non-empty. The types of the entries in each of the six buffers of a queue pair is also summarised in Fig. [5.](#page-10-0)

Recall that a remote operation  $rc<sup>n</sup>$  may either be a get, put or remote fence instruction. When t executes a remote operation rc<sup>n</sup> towards node  $\overline{n}$ , it adds rc<sup>n</sup> to its store buffer. That is, as well as CPU (delayed) writes, a store buffer may contain remote (get, put and fence) operations. Once  $rc^n$ is debuffered from the store buffer, it is forwarded to its  $\overline{n}$ -queue pair, where it travels through the queue-pair pipeline and is processed differently depending on the type of rc<sup>n</sup>, as we describe below.

**Request Buffers (req<sub>L</sub>**). The req<sub>L</sub> is the entry point of the queue pair, containing remote requests that are to be forwarded. It comprises a sequence of remote get, put and fence operations to be handled by the local NIC. When an entry  $r c^n$  reaches the head of req<sub>L</sub>, it is processed as follows. (1) If  $rc^n$  is a get, then it is simply forwarded to the remote inbox  $in_R$ . (2) If  $rc^n$  is a put  $x^{\overline{n}} := y$ , then the value  $v$  of  $y$  is looked up in the local memory and  $x^{\overline{n}} := v$  is forwarded to  $\text{in}_R$ ; i.e.  $y$  in  $x^{\overline{n}} := y$  is replaced with its current in-memory value. (3) If  $rc<sup>n</sup>$  is a remote fence, then the execution on the queue pair is blocked until it has no pending operations, i.e.  $\text{in}_R$ , out<sub>R</sub> and rsp<sub>L</sub> are all empty.

<span id="page-10-0"></span>

|                  | $y^n := x^n$ |  | $y^{\overline{n}} := v$ ack <sub>p</sub> $x^n := y^{\overline{n}}$ $x^n := v$ cn |  | rfence $\overline{n}$ |
|------------------|--------------|--|----------------------------------------------------------------------------------|--|-----------------------|
| $b \in SBuff$    |              |  |                                                                                  |  |                       |
| reqL             |              |  |                                                                                  |  |                       |
| in <sub>R</sub>  |              |  |                                                                                  |  |                       |
| $w_{R}$          |              |  |                                                                                  |  |                       |
| $_{\rm out_R}$   |              |  |                                                                                  |  |                       |
| rsp <sub>L</sub> |              |  |                                                                                  |  |                       |
| wb <sub>L</sub>  |              |  |                                                                                  |  |                       |

Fig. 5. The types of entries in the store buffers (b) of a thread on *n* and its six buffers on the  $\overline{n}$ -queue pair

**Inbox Buffers (in<sub>R</sub>)**. The in<sub>R</sub> contains requests forwarded by req<sub>L</sub> that are to be processed by  $\overline{n}$ , i.e. each req<sub>L</sub> entry is either a get or a put with a value (of the form  $y^{\overline{n}} := v$ ). Processing a put  $y^{\overline{n}} := v$  (once at in<sub>R</sub> head) forwards it to the remote write-back buffer wb<sub>R</sub> and also sends a *put* acknowledgement,  $ack_p$ , to  $out_R$ . Processing a get  $x := y^{\overline{n}}$  (once at  $in_R$  head) does not immediately fetch the *y* value from the  $\overline{n}$  memory; rather, it forwards it to **out<sub>R</sub>** to be fulfilled later in **out**<sub>R</sub>.

Outbox Buffers (out<sub>R</sub>). The out<sub>R</sub> buffer contains requests processed or forwarded by in<sub>R</sub>, and thus each entry in out<sub>R</sub> is either a put acknowledgement (ack<sub>p</sub>), a get operation  $x := y^{\frac{1}{n}}$  *yet to be* fulfilled, or a fulfilled get  $x := v$ . An acknowledgement or a fulfilled get is processed when it reaches the head of out<sub>R</sub>, whereupon it is simply forwarded to  $rsp<sub>L</sub>$ . By contrast, a yet-to-be-fulfilled get  $x := y^{\overline{n}}$  may be fulfilled at *any time* (before reaching the **out**<sub>R</sub> head), where the value  $v$  of  $y$  is fetched from the memory of  $\overline{n}$ , and  $x := y^{\overline{n}}$  is transformed to the fulfilled get  $x := v$  and left in out<sub>R</sub>. This fulfilled get is later processed once it reaches the head of  $out<sub>R</sub>$ , as described above.

**Remote Write-Back Buffers (wbR)**. The wb<sub>R</sub> buffer contains requests forwarded by  $in<sub>R</sub>$ , and thus each entry in  $\bf{w} \bf{b}_R$  is a put operation with a value (of the form  $y^{\frac{1}{n}} := v$ ). Processing  $y^{\frac{1}{n}} := v$  (at the head of wb<sub>R</sub>) simply removes it from wb<sub>R</sub> and writes  $v$  to  $y$  in the memory of  $\overline{n}$ .

Response Buffers ( $resp<sub>L</sub>$ ). The  $resp<sub>L</sub>$  buffer contains processed requests forwarded by out<sub>R</sub>, and thus each entry in  $rsp<sub>L</sub>$  is either an  $ack<sub>p</sub>$ , acknowledging a processed put, or a fulfilled get. Processing each rsp<sub>L</sub> entry (once at the head of rsp<sub>L</sub>) generates a *completion notification*, cn, which is used to serve poll requests, as we describe shortly. Specifically, processing  $ack_{D}$  simply removes it from  $\text{rsp}_L$  and forwards a completion notification to  $\text{wb}_L$ . Analogously, processing a fulfilled get  $x := v$  simply forwards  $x := v$  together with a completion notification to  $\mathbf{wb}_L$ .

**Local Write-Back Buffers (wb**<sub>L</sub>). The  $wb$ <sub>L</sub> buffer contains processed requests forwarded by  $rsp_L$ ; i.e. each entry in  $wb_L$  is either a completion notification (cn, associated with processed gets and puts), or a fulfilled get (of the form  $x := v$ ). Processing a fulfilled get  $x := v$  simply removes it and writes  $v$  to  $x$  in the local memory. A completion notification is left in the wb<sub>L</sub> and is only removed when the associated get/put operation is polled, as we describe below.<sup>[4](#page-10-1)</sup>

<span id="page-10-2"></span>**Poll Operations**. Once a get enters req<sub>L</sub>, it progresses through the pipeline as follows.  $(G_1)$  it is forwarded to  $\mathbf{in}_{\mathbb{R}}$ ;  $(G_2)$  it is forwarded to out<sub>R</sub> without being fulfilled;  $(G_3)$  it is fulfilled at some point while in out<sub>R</sub>; (G<sub>4</sub>) it is forwarded to rsp<sub>L</sub>; (G<sub>5</sub>) it is forwarded to wb<sub>L</sub> together with a completion notification (cn); and  $(G_6)$  it is removed from wb<sub>L</sub> and its effect is written to memory.

<span id="page-10-4"></span><span id="page-10-3"></span>Similarly, once a put operation  $x^{\overline{n}} := y$  enters the queue-pair pipeline it proceeds as follows. (P<sub>1</sub>) it is simplified to  $x^{\overline{n}} := v$  (where v is the value of y in the local memory) and forwarded to in<sub>R</sub>;  $(P_2)$  it is forwarded to wb<sub>R</sub> and simultaneously an acknowledgement ack<sub>p</sub> is forwarded to out<sub>R</sub>;

<span id="page-10-1"></span> $^4$ In some implementations, write-back buffers ( $\bf{w}b_L$  and  $\bf{w}b_R$ ) of different queue pairs may physically use the same hardware buffers in the PCIe fabric. This does not introduce any additional weak behaviours.

<span id="page-11-1"></span>

|                                                                                                                                                  | <b>Program transitions:</b> Prog $\xrightarrow{\text{Tid:Lab}\Downarrow{\{\varepsilon\}}}$ Prog |                                                                                                                                                                                                                                                                                                                                                    | <b>Command transitions:</b> Comm $\xrightarrow{\text{Lab}\uplus{\{\varepsilon\}}}$                         | → Comm                                                                                                         |
|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| Lab $\triangleq \bigcup_n$ Lab <sub>n</sub>                                                                                                      |                                                                                                 | $l \in \text{Lab}_n \triangleq \begin{cases} 1 \mathbb{W}(x^n, v), 1 \mathbb{R}(x^n, v), \text{CASS}(x^n, v_1, v_2), \text{CASF}(x^n, v), & x, y \in \text{Loc}, \\ \mathbb{F}, \mathbb{P}(\overline{n}), \text{Get}(x^n, y^{\overline{n}}), \text{Put}(y^{\overline{n}}, x^n), \mathbb{F}(\overline{n}) & v, v_1, v_2 \in \text{Val} \end{cases}$ |                                                                                                            |                                                                                                                |
| $C_1 \xrightarrow{l} C_1'$                                                                                                                       |                                                                                                 | $i \in \{1, 2\}$                                                                                                                                                                                                                                                                                                                                   |                                                                                                            |                                                                                                                |
| $\overline{C_1; C_2 \stackrel{l}{\rightarrow} C_1'; C_2}$                                                                                        |                                                                                                 | skip; $C \xrightarrow{\varepsilon} C$ $\overline{C_1 + C_2 \xrightarrow{\varepsilon} C_1^n}$ $\overline{C^* \xrightarrow{\varepsilon}$ skip $\overline{C^* \xrightarrow{\varepsilon} C_1 C^*}$                                                                                                                                                     |                                                                                                            |                                                                                                                |
| $\mathrm{C}\leadsto\mathrm{C}'$                                                                                                                  | $\text{elocs}(e) = \emptyset$                                                                   |                                                                                                                                                                                                                                                                                                                                                    | $\text{elocs}(e_{\text{old}}) = \text{elocs}(e_{\text{new}}) = \emptyset \qquad v \neq [[e_{\text{old}}]]$ |                                                                                                                |
| $\overline{C \xrightarrow{\varepsilon} C'}$                                                                                                      | $x := e \xrightarrow{\exists W(x, [[e]])}$ skip                                                 | $z := \text{CAS}(x, e_{\text{old}}, e_{\text{new}}) \xrightarrow{\text{CASF}(x,v)} z := v$                                                                                                                                                                                                                                                         |                                                                                                            |                                                                                                                |
|                                                                                                                                                  | $\text{elocs}(e_{\text{old}}) = \text{elocs}(e_{\text{new}}) = \emptyset$                       |                                                                                                                                                                                                                                                                                                                                                    |                                                                                                            |                                                                                                                |
| $z := \text{CAS}(x, e_{\text{old}}, e_{\text{new}}) \xrightarrow{\text{CAS}(x, [[e_{\text{old}}]], [[e_{\text{new}}]])} z := [[e_{\text{old}}]]$ |                                                                                                 |                                                                                                                                                                                                                                                                                                                                                    |                                                                                                            | mfence $\stackrel{F}{\longrightarrow}$ skip $x := \overline{y} \xrightarrow{\text{Get}(x, \overline{y})}$ skip |
| $\overline{u} := x \xrightarrow{\text{Put}(\overline{y},x)} \text{skip}$                                                                         |                                                                                                 | rfence $\overline{n} \xrightarrow{rF(\overline{n})}$ skip                                                                                                                                                                                                                                                                                          | $pol1(\overline{n}) \xrightarrow{P(\overline{n})}$ skip                                                    |                                                                                                                |
|                                                                                                                                                  |                                                                                                 |                                                                                                                                                                                                                                                                                                                                                    | $v \neq v'$ $P(t) \xrightarrow{l} C$                                                                       |                                                                                                                |
| $\text{assume}(x = v) \xrightarrow{\text{IR}(x, v)} \text{skip}$                                                                                 |                                                                                                 | $\overline{p \xrightarrow{t:l} P[f \mapsto c]}$ $\frac{1R(x,v)}{w}$ skip                                                                                                                                                                                                                                                                           |                                                                                                            |                                                                                                                |
| $x := e \leadsto \text{assume}(y = v); x := e[v/y]$                                                                                              |                                                                                                 | for $y \in \text{elocs}(e)$ , $v \in \text{Val}$                                                                                                                                                                                                                                                                                                   |                                                                                                            |                                                                                                                |
|                                                                                                                                                  |                                                                                                 | $z := \text{CAS}(x, e_{\text{old}}, e_{\text{new}}) \rightarrow \text{assume}(y = v); z := \text{CAS}(x, e_{\text{old}}[v/y], e_{\text{new}})$                                                                                                                                                                                                     |                                                                                                            | for $y \in \text{elocs}(e_{\text{old}}), v \in \text{Val}$                                                     |
|                                                                                                                                                  |                                                                                                 | $z := \text{CAS}(x, e_{\text{old}}, e_{\text{new}}) \rightarrow \text{assume}(y = v); z := \text{CAS}(x, e_{\text{old}}, e_{\text{new}}[v/y])$                                                                                                                                                                                                     |                                                                                                            | for $y \in \text{elocs}(e_{\text{new}}), v \in \text{Val}$                                                     |

<span id="page-11-2"></span>Fig. 6. The RDMA<sup>TSO</sup> program and command transitions

 $(P_3)$  its wb<sub>R</sub> entry is eventually removed and applied to the remote memory; its associated ack<sub>p</sub> in out<sub>R</sub> (P<sub>4</sub>) is forwarded to rsp<sub>L</sub>; and (P<sub>5</sub>) later forwarded to wb<sub>L</sub> as cn.

That is, both get and put operations result in a completion notification (cn) when complete. Indeed, this is precisely why we record an acknowledgement for each put: the  $ack_{D}$  serves as a placeholder for a processed put, so that we can generate an associated notification when complete.

Recall that poll $(\overline{n})$  awaits the completion of the earliest unpolled get/put towards  $\overline{n}$ . To achieve this, completion notifications are left in  $\mathbf{wb}_L$  until polled. Executing poll $(\overline{n})$  can proceed if the head of  $\bf{w}b_L$  of the  $\bar{n}$ -queue pair is a cn entry, in which case this cn entry (the earliest in FIFO order) is removed. If  $\overline{n}$ -wb<sub>L</sub> is empty or its head is a write entry, then the execution of  $pol1(\overline{n})$  is blocked.

### <span id="page-11-0"></span>3.2 RDMA<sup>TSO</sup> Concrete Operational Semantics

We describe the RDMA<sup>TSO</sup> concrete operational semantics by separating the transitions of its program and hardware subsystems. The former describe the steps in program execution, e.g. how a branching program is reduced. The latter describe how the memory, store buffers and queue pairs evolve throughout the execution, e.g. how remote puts reach the memory. The RDMA<sup>TSO</sup> operational semantics is then defined by combining the transitions of its program and hardware subsystems.

Program Transitions. Program transitions in the middle of Fig. [6](#page-11-1) are defined via the transitions of their constituent commands. Command transitions are of the form C  $\stackrel{l}{\to}$  C', where C, C'  $\in$  Comm are sequential commands and *l* is a *label*. A label is either  $\varepsilon$  for silent transitions, or in Lab (defined at the top of Fig. [6\)](#page-11-1) for executing a primitive command. Labels are defined as the union of  $\text{Lab}_n$ for all nodes n. A label in Lab<sub>n</sub> may be (1)  $\mathbb{1W}(x^n, v)$ , for a CPU write on location x with value v;

(2)  $\text{IR}(x^n, v)$ , for a CPU read on x reading  $v$ ; (3) CASS $(x^n, v_1, v_2)$ , for a successful CAS reading the expected value  $v_1$  from x and updating it to  $v_2$ ; (4) CASF( $x^n$ ,  $v$ ), for a failed CAS where the expected value does not match the value v of x in memory; (5) F, for a memory fence; (6)  $P(\overline{n})$ , for a poll on  $\overline{n}$ ; (7) Get $(x^n, y^{\overline{n}})$ , for  $x^n := y^{\overline{n}}$ ; (8) Put $(y^{\overline{n}}, x^n)$ , for  $y^{\overline{n}} := x^n$ ; or (9) rF( $\overline{n}$ ), for a remote fence on  $\overline{n}$ .

Command transitions for sequential composition, choice and loops (the top line) are standard. The next four transitions reduce assignments and CAS operations. As assignments and CAS involve expressions, their transitions rewrite expressions step-by-step, as defined at the bottom of Fig. [6](#page-11-1) (via  $\rightsquigarrow$  transitions). Each rewrite step of assignment rewrites  $x := e$  as assume( $y = v$ );  $x := e[v/y]$ , replacing a location  $y$  in  $e$  with an arbitrary value  $v$  and checking that  $v$  matches the value of y via assume( $y = v$ ). Note that value v is unconstrained at this point and is later constrained when connecting program transitions with memory ones. The two rewrite transitions for CAS are analogous. Observe that when  $C \rightsquigarrow C'$ , then C silently transitions to C' (with label  $\varepsilon$ ). Once  $e$ in  $x := e$  is closed (i.e. contains no locations), then it is reduced to skip with the corresponding CPU write label for writing the value of  $[[e]]$  to x; mutatis mutandis for CAS operations. The other transitions reduce memory fences, gets, puts, remote fences, polls and assumes to skip with matching labels. Finally, assume $(x = v)$  (resp. assume $(x \neq v')$ ) reduces to skip with a  $\text{IR}(x, v)$  label when the in-memory value  $v$  of  $x$  matches (resp. does not match) the assertion.

Program transitions are of the form P  $\stackrel{t:l}{\longrightarrow}$  P', where P, P' denote (concurrent) programs,  $t$  is the identifier of the executing thread, and  $l$  is the transition label. Program transitions simply lift the transitions of their constituent threads (the bottom right rule).

Hardware Transitions. The  $R_{\text{DM}}$ <sup>TSO</sup> hardware transitions, given in the middle of Fig. [7,](#page-13-0) are of the form M, B, QP  $\xrightarrow{t:l}$  M', B', QP', where M, M' are global *memories*, B, B' are *store-buffer maps* and QP, QP' are *queue-pair maps*, defined at the top of Fig. [7.](#page-13-0) We model a memory as a map from locations to values. A store-buffer map is a function mapping each thread  $t$  to a store buffer on its associated node  $n(t)$ . A store buffer b on node *n* is a sequence of CPU writes and RDMA puts, gets and fences, as described in [§3.1](#page-8-0) (see Fig. [5\)](#page-10-0). A queue-pair map is a function mapping each thread t (on node  $n(t)$ ) to another function associating each non- $n(t)$  node with a queue pair. That is, each thread  $t$  is associated with a set of queue pairs, one for each other node on the network. A queue pair qp is a tuple of six buffers,  $req<sub>L</sub>$ ,  $in<sub>R</sub>$ ,  $wb<sub>R</sub>$ ,  $out<sub>R</sub>$ ,  $rsp<sub>L</sub>$  and  $wb<sub>L</sub>$ , as described in [§3.1.](#page-8-0) Each queue-pair buffer in turn is a sequence of entries as prescribed in Fig. [5.](#page-10-0) We write 'qp.' to project components of qp and use a standard map update notation to modify these components.

The first five hardware transitions describe the execution of CPU operations, as described in [§3.1.](#page-8-0) Specifically, when a thread writes  $v$  to  $x$ , it records this write in its store buffer (first transition). Recall that when a thread  $t$  reads from  $x$ , it first consults its own store buffer, followed by the memory if no write to x is found in the store buffer. This lookup chain is captured by  $M \triangleleft B(t)$ (second transition), defined below the hardware transitions in Fig. [7.](#page-13-0) The execution of a CAS or mfence proceeds if the store buffer of the executing thread is empty (the next three transitions).

The next transition describes the debuffering of a CPU write and propagating it to the memory. Similarly, the transition after describes debuffering a remote operation  $r$  towards node  $\overline{n}$ , where it is removed from the store buffer and appended to the req<sub>L</sub> component of the  $\bar{n}$ -queue pair.

The next three transitions describe executing a remote get, put or fence operation, where a corresponding entry is added to the store buffer. The penultimate transition describes executing a poll on  $\overline{n}$ , which removes the earliest completion notification in wb<sub>L</sub> of the  $\overline{n}$ -queue pair. The last transition describes how the queue-pair entries travel through its six buffers, captured by the queue-pair transitions ( $\rightarrow_{\text{qp}}$ ) defined at the bottom of Fig. [7](#page-13-0) (ignoring highlights for now).

<span id="page-13-0"></span>![](_page_13_Picture_1814.jpeg)

| $B' = B[t \mapsto (x := v) \cdot B(t)]$ $(M \triangleleft B(t))(x) = v$<br>$B(t) = \varepsilon$ $M(x) = v_1$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| M, B, QP $\xrightarrow{t:\mathbb{1}\mathbb{W}(x,v)} M$ , B', QP $\xrightarrow{t:\mathbb{R}(x,v)} M$ , B, QP<br>M, B, QP $\xrightarrow{t:\text{CASS}(x,v_1,v_2)} M[x \mapsto v_2]$ , B, QP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| $B(t) = \varepsilon$ $M(x) = v$<br>$M, B, QP$ $L:CASF(x,v)$<br>$M, B, QP$ $M, B, QP$ $L:FA = v$<br>$M, B, QP$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| $B(t) = b \cdot rc^n$ $rc^n \in \left\{ x := y^{\overline{n}}, y^{\overline{n}} := x, \text{ r fence } \overline{n} \right\}$ $QP(t)(\overline{n}) = qp$ $qp' = qp[req_L \mapsto rc^n \cdot qp.read_L]$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| M, B, QP $\xrightarrow{t:\varepsilon}$ M, B[t $\mapsto$ b], QP[t $\mapsto$ QP(t)[ $\overline{n} \mapsto$ qp']]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| $\frac{B' = B[t \mapsto (x := \overline{y}) \cdot B(t)]}{M, B, QP}$ $\qquad \frac{B' = B[t \mapsto (\overline{y} := x) \cdot B(t)]}{M, B, QP}$ $\qquad \frac{B' = B[t \mapsto (r \text{ fence } \overline{n}) \cdot B(t)]}{M, B, QP}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| $QP(t)(\overline{n}) = qp$ qp.wb <sub>L</sub> = $\alpha \cdot cn$ qp' = qp[wb <sub>L</sub> $\mapsto \alpha$ ] M, QP(t)( $\overline{n}$ ) $\rightarrow$ qp M', qp                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| M, B, QP $\xrightarrow{t:\mathbb{P}(\overline{n})}$ M, B, QP $[t \mapsto \mathbb{Q}P(t)[\overline{n} \mapsto \mathbb{q}p']$ $M, B, \mathbb{Q}P \xrightarrow{t:\varepsilon} M', B, \mathbb{Q}P[t \mapsto \mathbb{Q}P(t)[\overline{n} \mapsto \mathbb{q}p]]$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| $(M \triangleleft \alpha)(x) \triangleq \begin{cases} v & \text{if } \alpha = \beta \cdot (x := v) \cdot - \wedge \forall v'. x := v' \notin \beta \\ M(x) & \text{if } \forall v. x := v \notin \alpha \end{cases}$<br>$% \left\vert \mathbf{U}\right\vert$ with                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| $M' = M$<br>$qp' = qp[req_L \mapsto \alpha][in_R \mapsto (x:=\overline{y}) \cdot qp.in_R]$<br>qp.req <sub>L</sub> = $\alpha \cdot (x:=\overline{y})$<br>$M' = M$<br>$\vee$ qp.in <sub>R</sub> = $\alpha \cdot (x:=\overline{y})$<br>$qp' = qp[$ <b>in</b> <sub>R</sub> $\mapsto \alpha$ ][ <b>out</b> <sub>R</sub> $\mapsto$ ( $x:=\overline{y}$ ) · qp. <b>out</b> <sub>R</sub> ]<br>$M' = M$<br>qp.wb <sub>R</sub> = $\varepsilon$ qp' = qp[out <sub>R</sub> $\mapsto \alpha \cdot (x:=M(\overline{y})) \cdot \beta$ ]<br>$\vee$ qp.out <sub>R</sub> = $\alpha \cdot (x:=\overline{y}) \cdot \beta$<br>$M' = M$<br>$qp' = qp[out_R \mapsto \alpha][rsp_L \mapsto (x:=v) \cdot qp.rsp_L]$<br>$\vee$ qp.out <sub>R</sub> = $\alpha \cdot (x:=v)$<br>$M' = M$<br>$qp' = qp[rsp_L \rightarrow \alpha][wb_L \rightarrow cn \cdot (x:=v) \cdot qp.wb_L]$<br>$\vee$ qp.rsp <sub>L</sub> = $\alpha \cdot (x:=v)$<br>$M' = M[x \mapsto v]$<br>$\vee$ qp.wb <sub>L</sub> = $\alpha \cdot (x:=v) \cdot cn^*$<br>$qp' = qp[wb_L \mapsto \alpha \cdot cn^*]$<br>$M$ , qp $\rightarrow$ <sub>qp</sub> $M'$ , qp' |
| $qp.wb_L = cn^*$ $qp' = qp[req_L \rightarrow \alpha][in_R \rightarrow (\overline{y} := M(x)) \cdot qp.in_R]$ $M' = M$<br>qp.req <sub>L</sub> = $\alpha \cdot (\overline{y}:=x)$<br>$qp' = qp[in_R \mapsto \alpha][wb_R \mapsto (\overline{y}:=v) \cdot qp.wb_R][out_R \mapsto \text{ack}_p \cdot qp.out_R]$ $M' = M$<br>$\vee$ qp.in <sub>R</sub> = $\alpha \cdot (\overline{y}:=v)$<br>$M' = M[\overline{y} \mapsto v]$<br>$\vee$ qp.wb <sub>R</sub> = $\alpha \cdot (\overline{y}:=v)$<br>$qp' = qp[wb_R \mapsto \alpha]$<br>$M' = M$<br>$qp' = qp[out_R \mapsto \alpha][rsp_L \mapsto ack_p \cdot qp.rsp_L]$<br>$\vee$ qp.out <sub>R</sub> = $\alpha \cdot$ ack <sub>p</sub><br>$M' = M$<br>$qp' = qp[rsp_L \mapsto \alpha][wb_L \mapsto cn \cdot qp.wb_L]$<br>$\vee$ qp. <b>rsp</b> <sub>L</sub> = $\alpha \cdot$ ack <sub>p</sub><br>$M$ , qp $\rightarrow$ <sub>qp</sub> $M'$ , qp'                                                                                                                                                                                                            |
| $qp' = qp[req_L \mapsto \alpha]$<br>qp.req <sub>L</sub> = $\alpha \cdot$ (rfence $\overline{n}$ )<br>$qp.in_R = qp.out_R = qp.rsp_L = \varepsilon$<br>$M$ , qp $\rightarrow$ <sub>qp</sub> $M$ , qp'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |

Fig. 7. RDMA<sup>TSO</sup> hardware domains (above), hardware transitions (middle) and queue-pair transitions (below)

Recall from [§3.1](#page-8-0) that once a get operation enters a queue pair it travels through the queue-pair pipeline in six steps, i.e.  $G_1-G_6$  $G_1-G_6$  on p. [11.](#page-10-2) This is captured by the top  $\rightarrow_{qp}$  transition at the bottom of Fig. [7,](#page-13-0) where for brevity we have combined these six steps into one transition with a disjunctive premise, with each disjunct corresponding to a step in  $G_1-G_6$  $G_1-G_6$ . That is, if any of the six disjuncts in

<span id="page-14-0"></span>
$$
\frac{P\xrightarrow{t:\varepsilon}P'}{P,M,B,QP\Rightarrow P',M,B,QP} \qquad \frac{M,B,QP\xrightarrow{t:\varepsilon}M',B',QP'}{P,M,B,QP\Rightarrow P,M',B',QP'} \qquad \frac{P\xrightarrow{t:\varepsilon}P'}{P,M,B,QP\Rightarrow P',M',B',QP'} \qquad \frac{P\xrightarrow{t:\varepsilon}P'}{P,M,B,QP\Rightarrow P',M',B',QP'}
$$

Fig. 8. RDMA<sup>TSO</sup> operational semantics with the program and hardware transitions given in Fig. [6](#page-11-1) and Fig. [7](#page-13-0)

the premise of the rule hold, then the transition is enabled. Analogously, the second  $\rightarrow_{\text{qq}}$  transition describes how a put operation proceeds through the queue-pair pipeline, with each of the five disjuncts describing one of the five steps in  $P_1-P_5$  $P_1-P_5$  (p. [11\)](#page-10-4). Finally, the last  $\rightarrow_{\alpha}$  transition describes the execution of a remote fence as described in [§3.1,](#page-8-0) ensuring that it can only proceed if there are no pending operations on the queue pair (i.e. qp.in<sub>R</sub> = qp.out<sub>R</sub> = qp.rsp<sub>L</sub> =  $\varepsilon$ ).

**RDMA<sup>TSO</sup> Combined Transitions**. The RDMA<sup>TSO</sup> operational semantics is defined by combining its program and hardware transitions as shown in Fig. [8.](#page-14-0) When the program subsystem takes a silent step, then the hardware subsystem is unchanged (first transition); analogously, when the hardware subsystem takes a silent step, then the program subsystem is unchanged (second transition). Finally, when the program and hardware subsystems both take the same transition (with the same label and by the same thread), then the transition effect is that of their combined effects.

**Operational Semantics without PCIe.** Recall from  $\S$ 2 (p. [8\)](#page-6-2) that we model the PCIe-specific guarantee where a NIC remote read propagates all pending NIC remote writes (in  $w b_R$ ) to memory, while a NIC local read propagates all pending NIC local writes (in  $\mathbf{wb}_L$ ) to memory. Nevertheless, we can relax this as follows. For NIC remote reads, we can replace the highlighted premises of the get queue-pair transition (top  $\rightarrow_{\text{qp}}$  transition) in Fig. [7](#page-13-0) with  $qp' = qp[out_R \mapsto \alpha \cdot (x:=(M \triangleleft wh_R)(\overline{q})) \cdot \beta].$ That is, we no longer require  $\bf{wb}_R$  to be empty (i.e. there may be pending writes in  $\bf{wb}_R$ ), and when reading the value of  $\overline{y}$  we first check for pending writes on  $\overline{y}$  in wb<sub>R</sub>.

For NIC local reads, we can similarly replace the highlighted premises of the put queue-pair transition (middle  $\rightarrow_{\text{qp}}$  transition) with qp' = qp[req<sub>L</sub>  $\mapsto \alpha$ ] [in<sub>R</sub>  $\mapsto (\overline{q} := (M \triangleleft \text{wb}_L)(x)) \cdot \text{qp.in}_R$ ]. That is, we no longer require  $\bf{wb}_L$  to contain only completion notifications (and allow it also to contain pending writes), and we first check  $\mathbf{wb}_L$  for pending writes when reading the value of x.

**Observations**. Given a thread  $t$  and its store buffer b, all remote operations by  $t$  also go through b. Hence, as store buffers are FIFO, a CPU write cc in b before a remote operation rc in b always reaches the memory before rc is debuffered, and thus cc is visible to rc. Moreover, as all six buffers of the queue-pair pipeline are FIFO, remote operations maintain the order in which they were issued as they go through the queue-pair pipeline. Therefore, a thread always receives the completion notifications of get/put operations in the order they were submitted (i.e. the program order).

Observe that an rFence stipulates that only in<sub>R</sub>, out<sub>R</sub> and rsp<sub>L</sub> be empty but not wb<sub>L</sub> and wb<sub>R</sub>. As such, rFence cannot guarantee that the result of earlier put (resp. get) operations have reached the remote (resp. local) memory: they can still be pending in  $\mathbf{wb}_{R}$  (resp.  $\mathbf{wb}_{L}$ ). Moreover, rFence has no bearing on CPU operations and does not block their execution. Hence, later CPU operations (after rFence) may be visible to earlier get/put operations (those before rFence).

Recall that a put operation  $x^n := y$  comprises a local read from y and a remote write to x, and a get operation  $x := y^n$  comprises a remote read from y and a local write to x. Note that the local read of a put is fulfilled when it reaches the head of  $\text{in}_{\text{R}}$  and is subsequently forwarded to out<sub>R</sub>. This ensures that puts are executed in program order and that puts are executed before all later gets (as in Fig. [2g\)](#page-4-0). By contrast, the remote read of a get is fulfilled while in out<sub>R</sub> non-deterministically (i.e. not necessarily when it is at the head of  $\text{out}_R$ ). This means that remote reads of gets can be reordered with respect to one another, as well as with respect to the remote writes of puts (as in Fig. [2h\)](#page-4-0). Such reorderings can be prevented by adding an rFence after a get (as in Fig. [2i\)](#page-4-0).

Lastly, note that a poll retrieves the earliest cn in  $\mathbf{wb}_{\mathsf{L}}$  (i.e. at its head). In the case of gets, the result of the get (its local write) is sent to  $\bf{wb}_L$  before its associated cn. As such, if the head of  $\bf{wb}_L$ is cn, then its result is guaranteed to have reached the memory of the local node when polling. By contrast, in the case of puts, its remote write operation could still be in  $w_{R}$  when polling, and thus polling a put cannot guarantee that the effect of the put has reached the remote memory.

# <span id="page-15-1"></span>3.3 RDMA<sup>TSO</sup> Simplified Operational Semantics

The concrete operational semantics in [§3.2](#page-11-0) reflects the structure of the underlying hardware, namely that of the six buffers in a queue pair. However, since all four buffers ( $req<sub>L</sub>$ ,  $inq<sub>R</sub>$ ,  $sqrt<sub>R</sub>$ ,  $rsp<sub>L</sub>$ ) in the middle are FIFO, we can simplify them by modelling them as a single buffer. Specifically, we model a simple queue pair,  $\sup_{n \in \mathbb{R}} \mathcal{L}(\text{Span}_{n}^{\overline{n}}) \triangleq \text{Pipe}_{n}^{\overline{n}} \times \text{WBR}_{n}^{\overline{n}} \times \text{WBL}_{n}^{\overline{n}}$ , as a tuple comprising a *pipe*, as well as local and remote write-back buffers (as before). A pipe, pipe ∈ Pipe $\frac{\overline{n}}{n}$  ≜  $\{y^{\overline{n}}:=x^n, y^{\overline{n}}:=v, \text{ack}_p, x^n:=y^{\overline{n}}, x^n:=v, \text{r} \text{ fence } \overline{n}\}^*$ , is simply a sequence of puts, gets, simplified puts and gets (with values), acknowledgements, and remote fences.

The program and command transitions of our simplified semantics is identical to those of the concrete one (Fig. [6\)](#page-11-1). Moreover, the hardware transitions of our simplified semantics are those of the concrete semantics (Fig. [7\)](#page-13-0), except that the queue-pair transitions at the bottom of Fig. [7](#page-13-0) are replaced with the simplified queue-pair transitions available in the extended version. As before, the simplified operational semantics is obtained by combining its program and hardware transitions (Fig. [8\)](#page-14-0).

Finally, we show that our concrete operational semantics is equivalent to the simplified one. We have mechanised our proof in Coq, available in the supplementary material [\[Ambal et al.](#page-24-9) [2024b\]](#page-24-9), with an overview of the proof available in the extended version.

<span id="page-15-2"></span>THEOREM 3.1. The concrete RDMA<sup>TSO</sup> operational semantics is equivalent to the simplified one.

# <span id="page-15-0"></span>4 RDMA<sup>TSO</sup> Declarative Semantics

Events and Executions. In the literature of declarative models, the traces of a program are commonly represented as a set of executions, where an execution is a graph comprising: i) a set of events (graph nodes); and ii) a number of relations on events (graph edges). Each event is associated with the execution of a primitive command (in PComm) and is a tuple  $(i, t, l)$ , where  $i$  is the (unique) event identifier,  $t \in$  Tid identifies the executing thread, and  $l \in$  ELab is the event label, defined below.

Definition 4.1 (Labels and events). An event, e  $\in$  Event, is a triple  $(i, t, l)$ , where  $i \in \mathbb{N}$ ,  $t \in \text{Tid}$ and  $l \in \mathsf{ELab}_{n(t)}$ . The set of event labels is  $\mathsf{ELab} \triangleq \bigcup_n \mathsf{ELab}_n$  for all nodes n. An event label of n,  $l \in$  ELab<sub>n</sub>, is a tuple of one of the following forms:

- (CPU) local read:  $l = \ln(x^n, v_r)$
- (CPU) local write:  $l = 1W(x^n, v_w)$
- (CPU) CAS:  $l = \text{CAS}(x^n, v_r, v_w)$
- (CPU) memory fence:  $l = F$
- (CPU) poll:  $l = P(\overline{n})$
- NIC local read:  $l = n \ln(x^n, v_r, \overline{n})$
- NIC remote write:  $l = nrW(y^{\overline{n}}, v_w)$
- NIC remote read:  $l = nrR(\overline{y^n}, v_r)$
- NIC local write:  $l = \text{n} 1 \text{W}(x^n, v_\text{w}, \overline{n})$
- NIC fence:  $l = nF(\overline{n})$

Each event label denotes whether the associated primitive command is handled by the NIC (right column, prefixed with n), or the CPU (left column). A poll instruction is handled by the CPU (it simply awaits for a completion notification from the NIC). A put operation  $x^n := y$ , which consists of a NIC local read from  $y$  and a NIC remote write to  $x$ , is modelled as two events of type nlR (on  $y$ ) and nrW (on x). Similarly, a get  $x := y^n$  is modelled as two events of type nrR (on y) and nlW (on x).

We write type(l),  $\text{loc}(l)$ ,  $v_r(l)$ ,  $v_w(l)$ ,  $n(l)$  and  $\overline{n}(l)$  for the type (e.g. 1R), location, read value, write value, node and remote node of l, where applicable; e.g.  $loc(nR(x^n, v_r, \overline{n})) = x^n$ ,  $n(nR(x^n, v_r, \overline{n})) = n$ and  $\overline{n}(n\ln(x^n, v_r, \overline{n})) = \overline{n}$ . We lift these functions to events as expected. We write  $\iota(e)$ ,  $\iota(e)$ ,  $\iota(e)$  to project the corresponding components of an event  $e = (t, t, l)$ .

Issue and Observation Points. In what follows we distinguish between when an instruction is issued and when it is observed. Intuitively, an instruction is issued when it is processed by the CPU or the NIC, and it is observed when its effect is propagated to the local or remote memory. As such, since writes (be they by the CPU or NIC) are the only instructions that have an observable effect on memory, the time points at which they are issued and observed may differ. A CPU write is issued when it is added to the store buffer and is observed when it is debuffered and propagated to memory. Similarly, the local (resp. remote) write of a get (resp. put) is issued when it is added to  $wb<sub>L</sub>$  (resp.  $wb<sub>R</sub>$ ), and observed when it is propagated to memory. By contrast, all other events are instantaneous in that either they do not have an observable effect on memory and thus their issue and observation points coincide, or their effect is written to memory immediately. In particular, CAS operations are instantaneous. Note that the observation point of any instruction either coincides with its issue point (instantaneous events) or it follows its issue point (write events).

Notation. Given a relation  $r$  and a set  $A$ , we write  $r^+$  for the transitive closure of  $r; r^{-1}$  for the inverse of r;  $r|_A$  for  $r \cap (A \times A)$ ; and [A] for the identity relation on A, i.e.  $\{(a, a) | a \in A\}$ . We write  $r_1; r_2$  for the relational composition of  $r_1$  and  $r_2$ :  $\{(a, b) | \exists c. (a, c) \in r_1 \wedge (c, b) \in r_2\}$ . When r is a strict partial order, we write  $r|_{\text{imm}}$  for the *immediate* edges in  $r$ , i.e.  $r \setminus (r;r)$ . Given a set of events  $E$ and a location x, we write  $E_x$  for  ${e \in E \mid loc(e) = x}$ . Given a set of events E and a label type X, we write E.X for {e ∈ E | type(e) = X}, and define its sets of reads as  $E.R \triangleq E.IR\cup E.CAS\cup E.nIR\cup E.nrR$ , writes as  $E.W \triangleq E.1$ W∪ $E.C$ AS∪ $E.n1$ W∪ $E.nr$ W, NIC writes as  $E.nW \triangleq E.n1$ W∪ $E.nr$ W and instantaneous events as E.Inst  $\triangleq E \setminus (E.1 \mathbb{W} \cup E. \mathbb{N} \cup E. \$ writes and NIC remote writes (labelled lW, nlW and nrW) are only visible when they respectively leave the store buffer,  $wb<sub>L</sub>$ , and  $wb<sub>R</sub>$ , and are thus excluded from the set of instantaneous events.

The 'same-location' relation is sloc  $\triangleq \{ (e, e') \in Event^2 \mid loc(e) = loc(e') \}$ ; the 'same-thread' relation is sthd  $\triangleq \{(e, e') \in Event^2 \mid t(e) = t(e')\}$ ; and the 'same-queue-pair' relation is sqp  $\triangleq \{(e, e') \in$ Event<sup>2</sup> |  $t(e) = t(e') \wedge \overline{n}(e) = \overline{n}(e')$ }. Note that sqp  $\subseteq$  sthd and that sloc, sthd and sqp are all symmetric. For a set of events E, we write E.sloc for sloc $|_E$ ; similarly for E.sthd and E.sqp.

<span id="page-16-0"></span>*Definition 4.2 (Pre-executions).* A *pre-execution* is a tuple  $G = \langle E, \text{po}, \text{rf}, \text{mo}, \text{pf}, \text{nf}_2 \rangle$ , where:

- $E \subseteq$  Event is the set of events and includes a set of *initialisation* events,  $E^0 \subseteq E$ , comprising a single write with label  $\mathsf{lw}(x, 0)$  for each  $x \in \text{Loc.}$
- po  $\subseteq E \times E$  is the '*program order*' relation defined as a disjoint union of strict total orders, each ordering the events of one thread, with  $E^0 \times (E \setminus E^0) \subseteq$  po.
- rf  $\subseteq E.W \times E.R$  is the 'reads-from' relation on events of the same location with matching values; i.e.  $(a, b) \in \text{rf} \Rightarrow (a, b) \in \text{sloc} \land v_w(a) = v_t(b)$ . Moreover, rf is total and functional on its range: every read in  $E.R$  is related to exactly one write in  $E.W$ .
- mo  $\triangleq \bigcup_{x \in \text{Loc}}$  mo<sub>x</sub> is the '*modification-order*', where each mo<sub>x</sub> is a strict total order on E. $W_x$ with  $E_x^0 \times (E \cdot W_x \setminus E_x^0) \subseteq \text{mo}_x$  describing the order in which writes on x reach the memory.
- pf  $\subseteq$  E.nW  $\times$  E.P is the 'polls-from' relation, relating earlier (in program-order) NIC writes to later poll operations on the *same queue pair*; i.e. pf ⊆ po ∩ sqp. Moreover, pf is functional on its domain (every NIC write can be be polled at most once), and pf is total and functional on its range (every poll in  $E.P$  polls from exactly one NIC write).
- nfo  $\subseteq$  *E*.sqp is the '*NIC flush order'*, such that for all  $(a, b) \in E$ .sqp, if  $a \in E$ .nlR,  $b \in E$ .nlW, then  $(a, b)$  ∈ nfo∪nfo<sup>-1</sup>, and if  $a \in E$ .nrR,  $b \in E$ .nrW, then  $(a, b) \in$  nfo∪nfo<sup>-1</sup>.

Recall from [§2](#page-3-0) that we model the PCIe-specific guarantee where a NIC local read (nlR) propagates all pending NIC local writes (nlW) in  $\bf{wb}_L$  (on the same queue pair) to memory, while a NIC remote read (nrR) propagates all pending NIC remote writes (nrW) in  $wb_R$  (on the same queue pair) to memory. In other words, the issue point of an nlR event  $e_r$  is totally ordered with respect to the issue and observation points of any nlW event  $e_w$  on the same queue pair. Specifically, either (1)  $e_w$ had already been flushed before issuing  ${\sf e}_r$  (i.e.  ${\sf e}_w$  had already left  ${\sf wb}_{{\sf L}})$  and thus  ${\sf e}_w$  is issued and observed before  ${\sf e}_r$ ; or (2)  ${\sf e}_w$  was in  ${\sf wb}_L$  before issuing  ${\sf e}_r$ , in which case we cannot issue  ${\sf e}_r$  until  $\bf{wb}_L$  is emptied, propagating  $\bf{e}_w$  to memory, and thus  $\bf{e}_w$  is issued and observed before  $\bf{e}_r$ ; or (3)  $\bf{e}_w$ has not reached  $\bf{wb}_L$  when we issue  $\bf{e}_r$ , in which case  $\bf{e}_w$  will reach and leave  $\bf{wb}_L$  (i.e. is issued and observed) after e<sub>r</sub>. Similarly for nrR/nrW events. We model this total order through the nfo relation, stipulating that all NIC local reads and writes (resp. all NIC remote reads and writes) on the same queue pair be totally ordered. Later we describe how we can relax this PCIe guarantee (see p. [21\)](#page-20-1).

**Derived Relations.** Given a pre-execution  $\langle E, \rho$ , rf, mo, pf, nfo), we define the 'reads-before' relation as rb ≜ (rf<sup>-1</sup>; mo) \ [E], relating each read *r* to writes that are mo-after the write *r* reads from. We further define the rf-buffer relation as rf<sub>b</sub>  $\triangleq$  [1W]; (rf ∩ sthd); [1R], including CPU rf edges by the same thread (with access to the same store buffer). We define the  $rf<sub>b</sub>$ -complement as  $r f_{\overline{b}} \triangleq r f \setminus r f_b$ , including all other rf edges (i.e. by different threads or involving remote operations). Intuitively, when  $w \stackrel{\text{rf}_b}{\longrightarrow} r$ , then  $r$  may read from  $w$  before it is observable. Specifically, as CPU writes are delayed in the store buffer and CPU reads first check the buffer,  $r$  can read from  $w$  either when (1)  $w$  is still in the thread's store buffer (i.e.  $w$  is not yet observable); or (2)  $w$  is in the memory (i.e.  $w$ is observable). By contrast, when  $w \stackrel{rf_b}{\longrightarrow} r$ , then  $r$  reads from  $w$  only once it is observable (i.e. it has reached the memory). Analogously, we define the rb-buffer relation as rb<sub>b</sub>  $\triangleq$  [1R]; (rb∩sthd); [1W].

Recall that we distinguish between the issue and observation points of an event. We thus define the 'issue-preserved program order', ippo, as the subset of po edges (ippo  $\subseteq$  po) that must be preserved when issuing instructions. That is, if two events are ippo-related, then they must be issued in program order; otherwise they may be reordered and thus issued in either order. The table at the top of Fig. [9](#page-18-0) describes which po edges are included in ippo, where  $\checkmark$  denotes that the two instructions are ippo-related (i.e. their issue order is preserved and they must be issued in program order),  $\chi$  denotes that they are not ippo-related (i.e. their issue order is not preserved and they may be issued out of order) and sqp denotes that they are ippo-related iff they are on the same queue pair. For instance, when a CPU instruction is followed by a NIC one, then they are issued in order (top-right quadrant of the table). By contrast, when a NIC instruction is followed by a CPU one, then they may be reordered (bottom-left quadrant), as if the NIC instruction was executed in a parallel thread (as discussed in  $\S$ 2), resulting in weak behaviours such as  $z=1$  in Fig. [2b.](#page-4-0)

Analogously, we define the 'observation-preserved program order', oppo, as the subset of po edges (oppo  $\subseteq$  po) that must be preserved when observing the effects of instructions. In other words, if two events are oppo-related, then they become observable in program order; otherwise they may be reordered and become observable in either order. The table at the bottom of Fig. [9](#page-18-0) describes which po edges are included in oppo, with  $\checkmark$ ,  $\checkmark$  and sqp interpreted in the same way as for ippo.

Observe that ippo and oppo only differ in four cells. In the case of B1 and B5, this is because CPU writes (type lW) are delayed in the store buffer and may be reordered and thus become observable after CPU operations that do not go through the store buffer, namely CPU reads and polls. Analogously, in the case of G10 (resp. I10), this is because NIC remote (resp. local) writes are delayed in  $w_{bR}$  (resp.  $w_{bL}$ ), so a later remote fence do not ensure the writes have propagated to memory.

<span id="page-18-0"></span>

|                          | ippo  |                | <b>CPU</b>      |                         |                         |                         | <b>NIC</b>   |     |                        |     |     |              |              |
|--------------------------|-------|----------------|-----------------|-------------------------|-------------------------|-------------------------|--------------|-----|------------------------|-----|-----|--------------|--------------|
|                          |       |                | $\mathbf{1}$    | $\overline{2}$          | 3                       | $\overline{4}$          | 5            | 6   | $\overline{7}$         | 8   | 9   | 10           |              |
|                          |       |                | 1R              | 1W                      | CAS                     | F                       | P            | n1R | nrW                    | nrR | nlW | nF           |              |
|                          | CPU   | $\overline{A}$ | 1R              | √                       | ✓                       | ✓                       | ✓            | J   | ✔                      | J   | ✓   | ✓            | ✓            |
|                          |       | B              | $\overline{1W}$ | ✓                       | J                       | J                       | ✔            | J   | ✓                      | ✓   | J   | ✓            | J            |
|                          |       | $\mathbf C$    | CAS             | J                       | ✔                       | ✔                       | ✔            | J   | J                      | ✔   | ✓   | J            | J            |
|                          |       | D              | F               | ✓                       | ✓                       | $\checkmark$            | $\checkmark$ | √   |                        |     | J   | $\checkmark$ | $\checkmark$ |
| Earlier in Program Order |       | E              | $\overline{P}$  | ✓                       | J                       | J                       | ✓            | J   | J                      | J   | J   | J            | J            |
|                          |       | F              | nIR             | Х                       | Х                       | Х                       | Х            | Х   | sqp                    | sqp | sqp | sqp          | sqp          |
|                          | $\Xi$ | G              | nrW             | $\overline{\textsf{x}}$ | Х                       | Х                       | Х            | Х   | Х                      | sqp | sqp | sqp          | sqp          |
|                          |       | H              | nrR             | X                       | X                       | X                       | X            | X   | X                      | X   | X   | sqp          | sqp          |
|                          |       | П              | nlW             | Х                       | Х                       | $\overline{\mathsf{x}}$ | Х            | Х   | Х                      | Х   | Х   | sqp          | sqp          |
|                          |       | J              | nF              | X                       | $\overline{\textsf{x}}$ | $\overline{\textbf{x}}$ | Х            | Х   | sqp                    | sqp | sqp | sqp          | sqp          |
|                          |       |                |                 |                         |                         |                         |              |     |                        |     |     |              |              |
|                          |       |                |                 |                         |                         |                         |              |     | Later in Program Order |     |     |              |              |
|                          |       |                |                 | <b>CPU</b>              |                         |                         |              |     | <b>NIC</b>             |     |     |              |              |
|                          | oppo  |                | $\mathbf{1}$    | $\overline{2}$          | 3                       | $\overline{4}$          | 5            | 6   | $\overline{7}$         | 8   | 9   | 10           |              |
|                          |       |                |                 | 1R                      | 1W                      | CAS                     | F            | P   | nlR                    | nrW | nrR | nlW          | nF           |
|                          | CPU   | A              | 1R              | J                       | J                       | ✓                       | Ÿ            | J   | J                      | J   | ✓   | ✓            | ✓            |
|                          |       | B              | 1W              | Х                       | ✓                       | ✓                       | ✓            | Х   | ✓                      | ✓   | ✓   | ✓            | ✓            |
|                          |       | $\mathbf C$    | CAS             | ✓                       | ✓                       | ✓                       | ✓            | √   | ✓                      | √   | ✓   | ✓            | ✓            |
|                          |       | $\overline{D}$ | F               | J                       | J                       | $\bar{J}$               | $\checkmark$ | J   | ✓                      | J   | V   | J            | J            |
|                          |       | E              | P               | J                       | J                       | ✓                       | ✔            | J   | J                      | ✓   | J   | J            | J            |
|                          | $\Xi$ | F              | nIR             | Х                       | Х                       | Х                       | X            | Х   | sqp                    | sqp | sqp | sqp          | sqp          |
| Earlier in Program Order |       | G              | nrW             | Х                       | Х                       | Х                       | X            | Х   | Х                      | sqp | sqp | sqp          | Х            |
|                          |       | H              | nrR             | $\overline{\textsf{x}}$ | Х                       | X                       | Х            | Х   | Х                      | Х   | X   | sqp          | sqp          |
|                          |       | П              | nlw             | Х                       | Х                       | Х                       | Х            | X   | X                      | X   | Х   | sqp          | X            |
|                          |       | J              | nF              | X                       | Х                       | Х                       | Х            | X   | sqp                    | sqp | sqp | sqp          | sqp          |

Later in Program Order

Fig. 9. The RDMA<sup>TSO</sup> ordering constraints on ippo (above) and oppo (below), where ✓ denotes that instructions are ordered (and cannot be reordered),  $\chi$  denotes they are not ordered (and may be reordered), and sqp denotes they are ordered iff they are on the same queue pair. Replacing sqp in the highlighted cells (G8-G9) with  $X$  would yield oppo for the weaker model without the PCIe guarantee of reads flushing buffers (p. [21\)](#page-20-1).

From Programs to Executions. The *semantics* of a program P is a set of *consistent* executions (defined shortly) of P, defined by induction on the structure of P. This definition is standard and omitted (see the extended version). The executions produced by this construction are well-formed, as we define below.

<span id="page-18-1"></span>*Definition 4.3 (Executions).* A pre-execution  $G = \langle E, \text{po}, \text{rf}, \text{mo}, \text{pf}, \text{no}\rangle$  is well-formed if the following hold for all  $w, r, w_1, w_2, p_2$ :

(1) Poll events poll-from the oldest non-polled remote operation on the same queue pair:

if  $w_1 \in G$ .nW and  $w_1 \xrightarrow{\text{po} \cap \text{sqp}} w_2 \xrightarrow{\text{pf}} p_2$ , then there exists  $p_1$  such that  $w_1 \xrightarrow{\text{pf}} p_1 \xrightarrow{\text{po}} p_2$ .

- (2) Each put (resp. get) operation corresponds to two events: a read and a write with the read immediately preceding the write in po: (a) if  $r \in G.n \in \mathbb{R}$  (resp.  $r \in G.n \in \mathbb{R}$ ), then  $(r, w) \in \text{pol}_{\text{imm}}$ for some  $w \in G$ .nrW ( $w \in G.$ nlW); and (b) if  $w \in G.$ nrW (resp.  $w \in G.$ nlW), then  $(r, w) \in \text{pol}_{\text{imm}}$ for some  $r \in G.n \, \text{IR}$  ( $r \in G.n \, \text{r} \, \text{R}$ ).
- (3) Read and write events of a put (resp. get) have matching values: if  $(r, w) \in G$ .po $|_{\text{imm}}$ , type $(r) \in \{n\}$ .R, nrR and type $(w) \in \{n\}$ .N, nrW  $\}$ , then  $v_r(r) = v_w(w)$ .

An execution is a pre-execution (Def. [4.2\)](#page-16-0) that is well-formed.

We use 'G' to project the components and (derived) relations of execution  $G$ ; e.g.  $G$ .  $f$  and G.ippo. When the choice of G is clear, we simply write e.g. rf and ippo. See the extended version for execution examples.

Consistency. Note that the notion of an execution (Def. [4.3\)](#page-18-1) imposes very few constraints on its po, rf, mo, pf and nfo relations. Such restrictions and thus the permitted behaviours of a program are determined by defining the set of consistent executions, defined below.

<span id="page-19-0"></span>Definition 4.4 (RDMA<sup>TSO</sup>-consistency). An execution  $\langle E, \rho$ o, rf, mo, pf, nfo) is RDMA<sup>TSO</sup>-consistent iff (1) ib is irreflexive; (2) ob is irreflexive; and (3)  $([Inst]; ib; ob)^+$  is irreflexive, where:

![](_page_19_Picture_928.jpeg)

The ib (resp. ob) relation is an extension of ippo (resp. oppo), describing the issue (resp. observation) order across the instructions of different threads and nodes. RDMA<sup>TSO</sup>-consistency requires that ib and ob be irreflexive (i.e. yield strict partial orders as they are defined transitively).

The rf (resp. pf) component in ib states that if e reads from (resp. polls from)  $w$ , then  $w$  must have been issued before e. Recall that nfo totally orders the nlR/nlW and nrR/nrW operations on the same queue pair and is thus in ib. The rb<sub>b</sub> component in ib ensures that a CPU read  $r$  by thread t on location x observes all earlier writes on x by t: if r in t reads from  $w_r$  ( $w_r \xrightarrow{rf} r$ ), which is later overwritten by w in  $t$  ( $w_r \stackrel{\text{mo}}{\longrightarrow} w$  and  $(w, r) \in \text{sthd}$ ), i.e.  $r \stackrel{\text{rb}_b}{\longrightarrow} w$ , then w must have been issued after r, as otherwise r should have read from the later  $w$  and not  $w_r$ . Note that this is not the case for rb  $\mathsf{b}$  rb<sub>b</sub>: *r* can be issued *after* a later write *w* and still not observe it because the effect of *w* has not yet reached the memory ( $w$  is in  $wb_L/wb_R$  or the store buffer of another thread).

The rf<sub>b</sub> component in ob states that if a read *r* reads from a write *w* that passed through a different buffer (i.e. either w is part of an RDMA operation and went through  $\text{wb}_L/\text{wb}_R$ , or w is a CPU write by another thread and thus went through a different store buffer), then  $r$  can only read from  $w$  once it has reached the memory, i.e. only once  $w$  is observable, and thus  $w$  must have become observable before r. The  $[n]$ ,  $pf$  component states that if  $p$  polls from a NIC local write  $w$ , then  $w$  must have left the  $w$ b<sub>L</sub> buffer and reached the memory. Note that this is not the case for nrW events: polling an nrW event  $w$  succeeds when  $w$  is in  $w\mathbf{b}_R$  and  $w$  may not have yet reached the memory. The nfo in ob can be justified as in the case of ib.

The rb component ensures that a read  $r$  on location  $x$  observes the latest write on  $x$  that has reached the memory: if  $w_r \stackrel{\text{rf}}{\longrightarrow} r$  and  $w_r \stackrel{\text{mo}}{\longrightarrow} w$ , i.e.  $r \stackrel{\text{rb}}{\longrightarrow} w$ , then w must have reached the memory after  $r$  was issued/observed, as otherwise  $r$  should have read from this later  $w$  and not  $w_r$ . As  ${\sf mo}$ describes the order in which the writes on each location reach the memory, it is included in ob.

The third condition Def. [4.4](#page-19-0) asks that  $([\texttt{Inst}]; \textit{ib}; \textit{ob})^+$  be irreflexive. Intuitively, if  $\mathsf{e}_1 \xrightarrow{\textit{ib}} \mathsf{e}_2 \xrightarrow{\textit{ob}} \mathsf{e}_3$ , then (1)  $e_1$  is issued before  $e_2$ ; (2)  $e_2$  is issued before it is observed; and (3)  $e_2$  is observed before  $e_3$ . If  $e_1$  and  $e_3$  are instantaneous events (i.e. their issue and observation points coincide) then  $e_1$ necessarily precedes  $e_3$ , and thus the semantics prohibits creating a cycle from these dependencies.

Finally, we show that our  $R_{\text{DM}}$ <sup>TSO</sup> declarative semantics is equivalent to the simplified operational semantics in [§3,](#page-7-0) with the full proof given in the extended version. Note that as a corollary (of theorem [3.1\)](#page-15-2), our declarative semantics is also equivalent to the concrete operational semantics in [§3.](#page-7-0)

THEOREM 4.5. The  $RDMA^{TSO}$  declarative semantics is equivalent to the simplified operational one.

Proc. ACM Program. Lang., Vol. 8, No. OOPSLA2, Article 341. Publication date: October 2024.

Recall that RDMA<sup>TSO</sup> assumes x86-TSO CPUs (subject to TSO consistency). As such, RDMA<sup>TSO</sup> can be seen as an extension of TSO [\[Alglave et al.](#page-24-4) [2014\]](#page-24-4). That is, as we show in the following theorem, for programs without remote operations, RDMA<sup>TSO</sup> and TSO coincide (see the extended version for the full proof).

#### THEOREM 4.6. For programs without remote operations, TSO- and RDMA<sup>TSO</sup>-consistency coincide.

<span id="page-20-1"></span>Declarative Semantics without PCIe. To relax the PCIe-specific constraint (stipulating that NIC local/remote reads propagates all pending NIC writes in  $\text{wb}_L/\text{wb}_R$  to memory), we adapt our declarative model as follows: (1) we no longer need the nfo relation in Def. [4.2](#page-16-0) and Def. [4.3](#page-18-1) (as nfo is used to capture the 'NIC flush order' under PCIe); (2) we replace sqp in cells G8 and G9 of the oppo table (Fig. [9\)](#page-18-0) with  $\chi$  (as a later nrR/nlW no longer flushes wb<sub>R</sub> and thus may not observe an earlier nrW); and (3) we redefine rf<sub>b</sub> and rb<sub>b</sub> as follows to include events on the same queue pair:

 $\mathsf{rf}_b \triangleq ([1\mathsf{W}]; (\mathsf{rf} \cap \mathsf{sthd}); [1\mathsf{R}]) \cup (\mathsf{rf} \cap \mathsf{sqp})$   $\qquad \qquad \mathsf{rb}_b \triangleq ([1\mathsf{R}]; (\mathsf{rb} \cap \mathsf{sthd}); [1\mathsf{W}]) \cup (\mathsf{rb} \cap \mathsf{sqp})$ 

Recall that  $w \stackrel{rf_b}{\longrightarrow} r$  (resp.  $r \stackrel{rb_b}{\longrightarrow} w$ ) denotes  $r$  reads from (resp. before)  $w$  before  $w$  is observable. In the strong model with PCIe guarantee, NIC local/remote reads flush the  $\mathbf{wb}_L/\mathbf{wb}_R$ , and thus r reads from (resp. before)  $w$  only once  $w$  is observable. By contrast, in the relaxed model without PCIe guarantee, NIC local/remote reads no longer flush the  $\mathbf{wb}_L/\mathbf{wb}_R$ , and thus r can read from (resp. before) w while it is still in  $wb_L/wb_R$  (i.e. not yet observable). We thus expand the rf<sub>b</sub> (resp. rb<sub>b</sub>) definition to account for the possibility of reading from (resp. before) a write before it is observable.

### <span id="page-20-0"></span>5 Validating the RDMA<sup>TSO</sup> Model

To complement our formal semantics, we conducted an extensive validation of litmus tests on two distinct setups: InfiniBand (IB) and RDMA over Converged Ethernet (RoCE). In the IB setup, we used Dell PowerEdge R740 x86-64 machines, Intel(R) Xeon(R) Gold 6132 CPUs (2.60GHz, 14 cores), Ubuntu 22.04.2 LTS with Mellanox Technologies MT28908 Family [ConnectX-6] controller. In the RoCE setup, we used machines with Intel(R) Xeon(R) E-2286G CPUs (4.00GHz, 14 cores), ran Linux kernel version 4.18.0, Red Hat Enterprise Linux version 8 (477.27.1), with a Mellanox Technologies MT27800 Family [ConnectX-5] controller. Each of our tests is written in C – see the extended version and the code in the supplementary material [\[Ambal et al.](#page-24-9) [2024b\]](#page-24-9).

Litmus Tests and Outcomes. We focused our validation efforts on a set of 37 litmus tests representative of a wide range of allowed and disallowed weak behaviours, including several variants of well-known concurrent tests in the literature such as 'store buffering' (SB), 'message passing' (MP), 'load buffering' (LB), 'independent reads of independent writes' (IRIW), parallel writes (2+2W) and so forth. Our results corroborate our  $RDMA<sup>TSO</sup>$  model and confirm that (1)  $RDMA<sup>TSO</sup>$ is not too strong (i.e. it does not prohibit behaviours exhibited by the hardware); and (2)  $R_{\text{IDMA}}^{\text{TSO}}$  is not too weak (i.e. it does not admit too many weak behaviours not exhibited by hardware). Indeed, despite extensive testing we detected no behaviours prohibited by  $R_{M}^{TSC}$ ; and we observed almost all behaviours allowed by RDMA<sup>TSO</sup>, with a few exceptions detailed below.

Bringing about Weak Behaviours. As with litmus testing for local (CPU-only) concurrency for established models such as TSO, observing a weak behaviour relies on several factors, including the order in which threads are scheduled and interleaved, as well as the timing of how writes are propagated and made visible to different threads (e.g. when writes are removed from the store buffer and propagated through the cache hierarchy and the memory). In the context of (RDMA) programs with remote operations, there are additional factors at play such as the NIC workload. As such, inducing weak behaviours necessitates creating suitable conditions, which may involve stressing the NIC and/or the CPU in the background. Indeed, as shown by the work of [Dan et al.](#page-25-8)  $[2016]$ , this is far from straightforward as they failed to observe *any* weak behaviours at all. To remedy this, we used several techniques to elicit the weak behaviours permitted by the specification. Specifically, since no testing tool currently exists for automatically overwhelming or decelerating an RDMA system, in close consultation with NVIDIA experts we devised several techniques for creating bottlenecks and fostering the emergence of weak behaviours, as outlined below:

- NIC delays: We introduced delays on the Network Interface Card (NIC) by initiating numerous flood RDMA read or write operations between the client and the server on the designated queue pair (QP) at a specific juncture within the litmus test. As these operations still obey the same ordering rules, they cause some of the other operations in the litmus test to be delayed until they are successfully completed.
- CPU delays: We strategically injected CPU delays of random length into the litmus tests at certain points. We did this by selecting a random duration from three possibilities: 0 nanoseconds, half of the round-trip time (RTT) and the full RTT, delaying the CPU execution by the chosen duration. RTT represents the round-trip time between the client and the server, capturing the duration it takes for a packet to travel from the sender to the receiver and back. We implemented the delay functionality using a busy-wait loop, where the CPU remains occupied, repeatedly polling the clock without executing additional tasks. Once the delay duration has elapsed, the loop concludes, allowing the CPU to resume normal execution. While such delays can bring about weak behaviours in certain tests, such as that in Fig. [2d,](#page-4-0) it may impede observing others such as that in Fig. [2b,](#page-4-0) which requires no delay between  $\overline{z}^2 \coloneqq x$ and  $x := 1$ . Therefore, by incorporating random delays, we aimed to encompass the full spectrum of potential scenarios, ensuring comprehensive coverage of behavioural variations across different tests. Interestingly, we observed that even minimal delays introduced by printing variables for debugging purposes can disrupt the manifestation of behaviours in certain tests (e.g. that in Fig. [2b\)](#page-4-0), hence we limited our use of print statements.
- High RDMA traffic loads: While introducing NIC delays as described above can sometimes help induce certain interleavings, e.g. by delaying one node while letting another to continue, in some cases delays do not help as both relevant operations are delayed by the same NIC delay operation. In such cases, using background traffic on unrelated queue pairs allows additional interleavings. Specifically, we generated concurrent high RDMA traffic loads on the background using numerous queue pairs and utilising the zero-copy mode (transferring data directly between the memory of the sender and receiver without intermediate copying). This increased RDMA activity introduces competition for shared system resources such as CPU cycles, memory bandwidth, network capacity and NIC processing capacity. This led to contention amongst threads in a litmus test, potentially altering their scheduling behaviour. This strategic approach proved pivotal in uncovering weak behaviours in several tests such as MP3bis in the extended version, which might otherwise have gone unnoticed. We observed that the number of queue pairs utilised to generate the traffic load plays a crucial role in certain tests. For instance, in the case of GFP2 (see the extended version), we used 128 queue pairs to induce the weak behaviour.
- Synchronisation: As is common practice in validation via litmus testing, we employed loops to bring about the desired weak behaviours. For instance, in order to ensure that a local (resp. remote) load operation reads the desired value required by a given weak behaviour, we place the operation within a loop (in effect replacing the single operation with multiple such operations), iterating until the desired value is read. In other cases, we placed the entire litmus test within a loop. Using loops this way can sometimes replace the CPU delay mechanism discussed above, allowing it to pinpoint the delay needed to reproduce the behaviour.

Validation Results. Using the techniques above, we successfully confirmed almost all weak behaviours allowed by  $R_{\rm DMA}^{\rm TSO}$ , including all  $\checkmark$  examples in Figs. [1](#page-3-1) to [3,](#page-6-0) as well as several variants of MP, SB, LB, IRIW, 2+2W, and others, observed at varying frequencies (see the extended version for more details). The frequency of observing these behaviours is influenced not only by the nuances of the litmus test itself, but also by the infrastructure settings such as the network setup, switch configuration and NIC capabilities. Moreover, dynamic and unpredictable factors such as network congestion, packet drops, latency fluctuations, bandwidth utilisation and routing changes also significantly impact the observations. An interesting phenomenon we encountered was the variability in behaviour manifestation even within a specific test scenario. For example, in the case of GFP2, the weak behaviour was observed at one of our test premises but not the other (we had two test premises at two distinct geographical locations). This observation highlights the intricacy and subtlety involved in inducing weak behaviours.

In 4 of the litmus tests, we did not observe the allowed weak behaviour because the hardware implementation did not utilise the weakness permitted by the specification. As discussed in [§2,](#page-3-0) manifesting the weak behaviour in such cases relies on polling a put operation before the remote NIC write completes, e.g. as in Fig. [3d.](#page-6-0) This weak behaviour is allowed by  $R_{\text{DMA}}^{\text{TSD}}$  because the specification allows the remote NIC to return an acknowledgement before performing the associated write. Nevertheless, according to NVIDIA experts, this weakness is not utilised by the hardware implementations. To the best of our knowledge, these weak behaviours cannot be observed on current hardware implementations, but they might emerge in future ones.

Limitations. The main limitation of our validation is that weak behaviours are only exposed by hand-crafted techniques (discussed above) that stress the system in certain ways. This is currently a difficult trial-and-error process that requires a high degree of knowledge of current hardware implementations (which we acquired through close consultation with NVIDIA experts). As such, executing RDMA tests is currently not amenable to mass automation as in the frameworks of [\[Alglave et al.](#page-24-3) [2021,](#page-24-3) [2014;](#page-24-4) [Raad et al.](#page-26-6) [2022\]](#page-26-6). To adapt these frameworks for RDMA, one would need to develop systematic heuristics for automatically applying the techniques discussed above. We leave this challenge to future work.

#### <span id="page-22-0"></span>6 Related and Future Work

RDMA Semantics. To our knowledge, the only existing work on the formal semantics of RDMA programs is that of coreRMA [\[Dan et al.](#page-25-8) [2016\]](#page-25-8), which has several key limitations, as follows. (1) Although [Dan et al.](#page-25-8) [\[2016\]](#page-25-8) attempted to validate coreRMA, they observed none of the weak behaviours allowed by coreRMA in existing hardware implementations. This is in contrast to our work, where we have observed almost all weak behaviours allowed by RDMA<sup>TSO</sup> in existing implementations, and in the rare cases where we could not observe a behaviour, we have confirmed that this is because existing implementations explicitly did not utilise the weakness allowed by the specification. (2) coreRMA assumes that CPU concurrency is governed by the strong and unrealistic SC model [\[Lamport](#page-25-11) [1979\]](#page-25-11). (3) coreRMA only presents a declarative (and not operational) characterisation. (4) Most importantly, coreRMA departs from the RDMA specification [\[IBTA](#page-25-2) [2022\]](#page-25-2) in three important ways, as follows.

First, they do not model the poll operation,  $pol1(n)$ , described in the specification. Instead, they model a flush operation,  $\text{Flush}(n)$ , whose behaviour does not match any operation defined in the specification. Specifically, flush(n) waits for all previous RDMA operations on the n-queue pair to complete and further blocks later CPU operations and RDMA operations on the same queue pair. In other words,  $f$ lush $(n)$  is tantamount to rfence $(n)$  followed by mfence. However,  $f$ lush $(n)$  does not block later RDMA operations on different queue pairs, and hence under coreRMA there is no clean way to enforce an order on two RDMA operations on different queue pairs.

For instance, to ensure that  $x := y^2$ ;  $z^3 := x$  is executed in order (so that  $z$  gets the updated value of x), it is not enough to add a flush and write  $x := y^2$ ; flush(2);  $z^3 := x$ . Instead, one must also add a superfluous *CPU operation* and write e.g.  $x\coloneqq y^2; \overline{\mathtt{flush}(2)}; a\coloneqq w; z^3\coloneqq x.$  This is because flush(2) blocks the CPU operation  $a := w$ , which in turn blocks  $z^3 := x$ .

Second, RDMA<sup>TSO</sup> preserves the order between two NIC local reads and two NIC local writes (cells F6 and I9 in Fig. [9\)](#page-18-0), while coreRMA does not, violating the RDMA specification [\[IBTA](#page-25-2) [2022\]](#page-25-2). For instance, given  $y^2 \coloneqq x; z^2 \coloneqq w$ ,  $\text{\sc rbm{M}}^{\rm \sc ro}$  guarantees  $x$  is read before  $w$ , while there is no such guarantee under coreRMA; i.e. in such scenarios coreRMA is weaker than the specification.

Finally, as per the RDMA specification, under RDMA<sup>TSO</sup> remote reads can be reordered with respect to other remote operations on the same queue pair (cells H7 and H8), while under coreRMA they cannot; i.e. in such scenarios coreRMA is stronger than the specification. Consequently, as coreRMA is weaker than the specification in some scenarios and stronger in others, it is neither a strict abstraction, nor a strict refinement of the specification.

Weak Memory Models. Existing literature includes several examples of weak consistency models, both at hardware and software levels. On the hardware side, several works have formalised the semantics of the x86 architecture [\[Abdulla et al.](#page-24-10) [2015;](#page-24-10) [Alglave et al.](#page-24-4) [2014;](#page-24-4) [Raad et al.](#page-26-6) [2022;](#page-26-6) [Sewell](#page-26-11) [et al.](#page-26-11) [2010\]](#page-26-11). However, none of these works covered the consistency semantics of RDMA programs in the context of x86 machines. Similarly, several works have formalised the semantics of the ARMv8 and POWER architectures, both operationally and declaratively [\[Alglave et al.](#page-24-3) [2021,](#page-24-3) [2014;](#page-24-4) [Chakraborty and Vafeiadis](#page-24-11) [2019;](#page-24-11) [Flur et al.](#page-25-14) [2016;](#page-25-14) [Mador-Haim et al.](#page-25-9) [2012;](#page-25-9) [Pulte et al.](#page-26-4) [2018;](#page-26-4) [Sarkar](#page-26-10) [et al.](#page-26-10) [2011\]](#page-26-10). On the software side, there has been a number of formal models for C11 consistency [\[Batty et al.](#page-24-12) [2011;](#page-24-12) [Kang et al.](#page-25-15) [2017;](#page-25-15) [Lahav et al.](#page-25-16) [2016,](#page-25-16) [2017;](#page-25-10) [Lee et al.](#page-25-17) [2020;](#page-25-17) [Nienhuis et al.](#page-25-18) [2016;](#page-25-18) [Pichon-Pharabod and Sewell](#page-26-13) [2016\]](#page-26-13) with verified compilation schemes [\[Moiseenko et al.](#page-25-19) [2020;](#page-25-19) [Podkopaev et al.](#page-26-14) [2017,](#page-26-14) [2019\]](#page-26-15), Java [\[Bender and Palsberg](#page-24-13) [2019;](#page-24-13) [Manson et al.](#page-25-20) [2005\]](#page-25-20), transactional memory [\[Raad et al.](#page-26-16) [2018,](#page-26-16) [2019a;](#page-26-17) [Xiong et al.](#page-26-18) [2020\]](#page-26-18), the Linux kernel [\[Alglave et al.](#page-24-14) [2018\]](#page-24-14) and the ext4 filesystem [\[Kokologiannakis et al.](#page-25-21) [2021\]](#page-25-21). Additionally, there has been several works on formalising the persistency semantics of programs in the context of non-volatile memory, describing the behaviour of programs in case of crashes [\[Cho et al.](#page-24-5) [2021;](#page-24-5) [Khyzha and Lahav](#page-25-22) [2021;](#page-25-22) [Raad and](#page-26-7) [Vafeiadis](#page-26-7) [2018;](#page-26-7) [Raad et al.](#page-26-8) [2020b,](#page-26-8) [2019b\]](#page-26-9), as well as program logics for verifying such programs [\[Bila et al.](#page-24-15) [2022;](#page-24-15) [Raad et al.](#page-26-19) [2020a\]](#page-26-19).

Future Work. We plan to build over our work presented here in several ways. First, we aim to adapt existing methods for automatically generating litmus tests (e.g. [\[Alglave et al.](#page-24-3) [2021,](#page-24-3) [2014\]](#page-24-4)) to RDMA programs by developing heuristics for automatically applying the inducement/stressing techniques discussed in [§5](#page-20-0) for bringing about weak behaviours. Second, we plan to formalise the semantics of RDMA programs in the context of the ARMv8 hardware architecture (i.e. when CPU concurrency is governed by ARMv8 rather than TSO). Third, we plan to verify existing RDMA implementations such as those of the Verbs [\[linux-rdma](#page-25-4) [2018\]](#page-25-4) and libfabric [\[OpenFabrics](#page-25-5) [2016\]](#page-25-5) APIs. Lastly, using our formal RDMA<sup>TSO</sup> semantics, we plan to develop *manual* reasoning techniques such as program logics (underpinned by our operational semantics), as well as automated verification techniques such as model checking (based on our declarative semantics) for RDMA.

### Acknowledgments

We would like to thank Alexey Gotsman, Adam Morrison, Noam Rinetzky, and especially Yuri Meshman, for starting this research and whose preliminary memory model underpins the results presented here. Additional thanks to Yamin Friedman, Daniel Marcovitch and Liran Liss for sharing their insights into the IBTA specifications and NVIDIA's RDMA implementations. We also thank the anonymous reviewers for their valuable feedback and Viktor Vafeiadis for many fruitful discussions. Guillaume Ambal is supported by the EPSRC grant EP/X037029/1. Brijesh Dongol is supported by VeTSS and EPSRC projects EP/Y036425/1, EP/X037142/1, EP/X015149/1, EP/V038915/1, and EP/R025134/2. Ori Lahav is supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 851811) and the Israel Science Foundation (grant number 814/22). Azalea Raad is supported by a UKRI fellowship MR/V024299/1, by the EPSRC grant EP/X037029/1, and by VeTSS.

### References

- <span id="page-24-6"></span>Parosh Aziz Abdulla, Mohamed Faouzi Atig, Ahmed Bouajjani, K. Narayan Kumar, and Prakash Saivasan. 2021. Deciding Reachability under Persistent X86-TSO. Proc. ACM Program. Lang. 5, POPL, Article 56 (Jan. 2021), 32 pages. [https:](https://doi.org/10.1145/3434337) [//doi.org/10.1145/3434337](https://doi.org/10.1145/3434337)
- <span id="page-24-10"></span>Parosh Aziz Abdulla, Mohamed Faouzi Atig, and Tuan-Phong Ngo. 2015. The Best of Both Worlds: Trading Efficiency and Optimality in Fence Insertion for TSO. In Proceedings of the 24th European Symposium on Programming on Programming Languages and Systems - Volume 9032. Springer-Verlag New York, Inc., New York, NY, USA, 308–332. [https://doi.org/10.](https://doi.org/10.1007/978-3-662-46669-8_13) [1007/978-3-662-46669-8\\_13](https://doi.org/10.1007/978-3-662-46669-8_13)
- <span id="page-24-2"></span>Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, and Igor Zablotchi. 2019. The Impact of RDMA on Agreement. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, PODC 2019, Toronto, ON, Canada, July 29 - August 2, 2019, Peter Robinson and Faith Ellen (Eds.). ACM, 409–418. [https:](https://doi.org/10.1145/3293611.3331601) [//doi.org/10.1145/3293611.3331601](https://doi.org/10.1145/3293611.3331601)
- <span id="page-24-3"></span>Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2021. Armed Cats: Formal Concurrency Modelling at Arm. ACM Trans. Program. Lang. Syst. 43, 2 (2021), 8:1–8:54. <https://doi.org/10.1145/3458926>
- <span id="page-24-14"></span>Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan Stern. 2018. Frightening Small Children and Disconcerting Grown-Ups: Concurrency in the Linux Kernel. SIGPLAN Not. 53, 2 (March 2018), 405–418. [https:](https://doi.org/10.1145/3296957.3177156) [//doi.org/10.1145/3296957.3177156](https://doi.org/10.1145/3296957.3177156)
- <span id="page-24-4"></span>Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2 (2014), 7:1–7:74. <https://doi.org/10.1145/2627752>
- <span id="page-24-8"></span>Guillaume Ambal, Brijesh Dongol, Haggai Eran, Vasileios Klimis, Ori Lahav, and Azalea Raad. 2024a. Extended Version. <https://www.soundandcomplete.org/papers/OOPSLA2024/RDMA/rdma-extended.pdf>
- <span id="page-24-9"></span>Guillaume Ambal, Brijesh Dongol, Haggai Eran, Vasileios Klimis, Ori Lahav, and Azalea Raad. 2024b. Project page for Semantics of Remote Direct Memory Access. <https://www.soundandcomplete.org/papers/OOPSLA2024/RDMA>
- <span id="page-24-1"></span>Don Anderson. 1999. FireWire system architecture (2nd ed.): IEEE 1394a. Addison-Wesley Longman Publishing Co., Inc., USA.
- <span id="page-24-12"></span>Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '11). ACM, New York, NY, USA, 55–66. <https://doi.org/10.1145/1926385.1926394>
- <span id="page-24-13"></span>John Bender and Jens Palsberg. 2019. A Formalization of Java's Concurrent Access Modes. Proc. ACM Program. Lang. 3, OOPSLA, Article 142 (Oct. 2019), 28 pages. <https://doi.org/10.1145/3360568>
- <span id="page-24-15"></span>Eleni Vafeiadi Bila, Brijesh Dongol, Ori Lahav, Azalea Raad, and John Wickerson. 2022. View-Based Owicki–Gries Reasoning for Persistent x86-TSO. In Programming Languages and Systems, Ilya Sergey (Ed.). Springer International Publishing, Cham, 234–261.
- <span id="page-24-0"></span>M. S. Birrittella, M. Debbage, R. Huggahalli, J. Kunz, T. Lovett, T. Rimmer, K. D. Underwood, and R. C. Zak. 2015. Intel Omni-Path Architecture: Enabling scalable, high performance fabrics. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI) (HOTI 2015). 1–9. <https://doi.org/10.1109/HOTI.2015.22>
- <span id="page-24-7"></span>Ahmed Bouajjani, Egor Derevenetc, and Roland Meyer. 2013. Checking and Enforcing Robustness against TSO. In ESOP 2013 (LNCS, Vol. 7792). Springer, 533–553. [https://doi.org/10.1007/978-3-642-37036-6\\_29](https://doi.org/10.1007/978-3-642-37036-6_29)
- <span id="page-24-11"></span>Soham Chakraborty and Viktor Vafeiadis. 2019. Grounding Thin-Air Reads with Event Structures. Proc. ACM Program. Lang. 3, POPL, Article 70 (Jan. 2019), 28 pages. <https://doi.org/10.1145/3290383>
- <span id="page-24-5"></span>Kyeongmin Cho, Sung-Hwan Lee, Azalea Raad, and Jeehoon Kang. 2021. Revamping Hardware Persistency Models: View-Based and Axiomatic Persistency Models for Intel-X86 and Armv8. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (Virtual, Canada) (PLDI 2021). Association for Computing Machinery, New York, NY, USA, 16–31. <https://doi.org/10.1145/3453483.3454027>
- <span id="page-25-8"></span>Andrei Marian Dan, Patrick Lam, Torsten Hoefler, and Martin Vechev. 2016. Modeling and Analysis of Remote Memory Access Programming. SIGPLAN Not. 51, 10 (oct 2016), 129–144. <https://doi.org/10.1145/3022671.2984033>
- <span id="page-25-1"></span>D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A.M. Merritt, E. Gronke, and C. Dodd. 1998. The Virtual Interface Architecture. IEEE Micro 18, 2 (1998), 66–76. <https://doi.org/10.1109/40.671404>
- <span id="page-25-14"></span>Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL '16). Association for Computing Machinery, New York, NY, USA, 608–621. <https://doi.org/10.1145/2837614.2837615>
- <span id="page-25-0"></span>Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2018. Enabling Highly Scalable Remote Memory Access Programming with MPI-3 One Sided. Commun. ACM 61, 10 (sep 2018), 106–113. <https://doi.org/10.1145/3264413>
- <span id="page-25-2"></span>IBTA. 2022. InfiniBand Architecture Specification Volume 1 Release 1.6. [https://www.infinibandta.org/ibta-specification/.](https://www.infinibandta.org/ibta-specification/)
- <span id="page-25-3"></span>InfiniBand Trade Association (IBTA). 2018. The RoCE Initiative. <https://www.infinibandta.org/roce-initiative/> (Accessed: July 2023).
- <span id="page-25-15"></span>Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A Promising Semantics for Relaxed-Memory Concurrency. SIGPLAN Not. 52, 1 (Jan. 2017), 175–189. <https://doi.org/10.1145/3093333.3009850>
- <span id="page-25-22"></span>Artem Khyzha and Ori Lahav. 2021. Taming X86-TSO Persistency. Proc. ACM Program. Lang. 5, POPL, Article 47 (Jan. 2021), 29 pages. <https://doi.org/10.1145/3434328>
- <span id="page-25-21"></span>Michalis Kokologiannakis, Ilya Kaysin, Azalea Raad, and Viktor Vafeiadis. 2021. PerSeVerE: Persistency Semantics for Verification under Ext4. Proc. ACM Program. Lang. 5, POPL, Article 43 (jan 2021), 29 pages. <https://doi.org/10.1145/3434324>
- <span id="page-25-12"></span>Ori Lahav and Udi Boker. 2020. Decidable verification under a causally consistent shared memory. In PLDI 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 211–226. <https://doi.org/10.1145/3385412.3385966>
- <span id="page-25-16"></span>Ori Lahav, Nick Giannarakis, and Viktor Vafeiadis. 2016. Taming Release-Acquire Consistency. SIGPLAN Not. 51, 1 (Jan. 2016), 649–662. <https://doi.org/10.1145/2914770.2837643>
- <span id="page-25-10"></span>Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing Sequential Consistency in C/C++11. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) (PLDI 2017). Association for Computing Machinery, New York, NY, USA, 618–632. [https://doi.org/10.](https://doi.org/10.1145/3062341.3062352) [1145/3062341.3062352](https://doi.org/10.1145/3062341.3062352)
- <span id="page-25-11"></span>Leslie Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Computers 28, 9 (Sept. 1979), 690–691. <https://doi.org/10.1109/TC.1979.1675439>
- <span id="page-25-17"></span>Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis. 2020. Promising 2.0: Global Optimizations in Relaxed Memory Concurrency. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 362–376. <https://doi.org/10.1145/3385412.3386010>
- <span id="page-25-4"></span>linux-rdma. 2018. RDMA core. <https://github.com/linux-rdma/rdma-core/> (Accessed: Jul. 2023).
- <span id="page-25-9"></span>Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Alglave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek Williams. 2012. An Axiomatic Memory Model for POWER Multiprocessors. In Computer Aided Verification - 24th International Conference, CAV 2012, Berkeley, CA, USA, July 7-13, 2012 Proceedings (Lecture Notes in Computer Science, Vol. 7358), P. Madhusudan and Sanjit A. Seshia (Eds.). Springer, 495–512. [https://doi.org/10.1007/978-](https://doi.org/10.1007/978-3-642-31424-7_36) [3-642-31424-7\\_36](https://doi.org/10.1007/978-3-642-31424-7_36)
- <span id="page-25-20"></span>Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Long Beach, California, USA) (POPL '05). Association for Computing Machinery, New York, NY, USA, 378–391. <https://doi.org/10.1145/1040305.1040336>
- <span id="page-25-19"></span>Evgenii Moiseenko, Anton Podkopaev, Ori Lahav, Orestis Melkonian, and Viktor Vafeiadis. 2020. Reconciling Event Structures with Modern Multiprocessors (Artifact). Dagstuhl Artifacts Series 6, 2 (2020), 4:1–4:3. <https://doi.org/10.4230/DARTS.6.2.4>
- <span id="page-25-18"></span>Kyndylan Nienhuis, Kayvan Memarian, and Peter Sewell. 2016. An Operational Semantics for C/C++11 Concurrency. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) (OOPSLA 2016). Association for Computing Machinery, New York, NY, USA, 111–128. <https://doi.org/10.1145/2983990.2983997>
- <span id="page-25-6"></span>Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2014. Scale-out NUMA. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (Salt Lake City, Utah, USA) (ASPLOS '14). Association for Computing Machinery, New York, NY, USA, 3–18. [https://doi.org/](https://doi.org/10.1145/2541940.2541965) [10.1145/2541940.2541965](https://doi.org/10.1145/2541940.2541965)
- <span id="page-25-7"></span>NVIDIA Corporation. 2021. NVIDIA BlueField-2 DPU. [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf)[Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf) (Accessed: Jul. 2023).
- <span id="page-25-5"></span>OpenFabrics. 2016. RDMA core. <https://ofiwg.github.io/libfabric/> (Accessed: Jul. 2023).

<span id="page-25-13"></span>PCI-SIG. 2022. PCI Express Base Specification Revision 6.0 Version 1.0. [https://pcisig.com/pci-express-6.0-specification.](https://pcisig.com/pci-express-6.0-specification)

- <span id="page-26-13"></span>Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-Air Executions. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL '16). Association for Computing Machinery, New York, NY, USA, 622–633. <https://doi.org/10.1145/2837614.2837616>
- <span id="page-26-14"></span>Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2017. Promising Compilation to ARMv8 POP. In 31st European Conference on Object-Oriented Programming (ECOOP 2017) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 74), Peter Müller (Ed.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 22:1–22:28. [https:](https://doi.org/10.4230/LIPIcs.ECOOP.2017.22) [//doi.org/10.4230/LIPIcs.ECOOP.2017.22](https://doi.org/10.4230/LIPIcs.ECOOP.2017.22)
- <span id="page-26-15"></span>Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the Gap Between Programming Languages and Hardware Weak Memory Models. Proc. ACM Program. Lang. 3, POPL, Article 69 (Jan. 2019), 31 pages. <https://doi.org/10.1145/3290382>
- <span id="page-26-4"></span>Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang. 2, POPL, Article 19 (Dec. 2018), 29 pages. <https://doi.org/10.1145/3158107>
- <span id="page-26-5"></span>Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. 2019. Promising-ARM/RISC-V: A Simpler and Faster Operational Concurrency Model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 1–15. <https://doi.org/10.1145/3314221.3314624>
- <span id="page-26-16"></span>Azalea Raad, Ori Lahav, and Viktor Vafeiadis. 2018. On Parallel Snapshot Isolation and Release/Acquire Consistency. In Programming Languages and Systems, Amal Ahmed (Ed.). Springer International Publishing, Cham, 940–967.
- <span id="page-26-17"></span>Azalea Raad, Ori Lahav, and Viktor Vafeiadis. 2019a. On the Semantics of Snapshot Isolation. In Verification, Model Checking, and Abstract Interpretation, Constantin Enea and Ruzica Piskac (Eds.). Springer International Publishing, Cham, 1–23.
- <span id="page-26-19"></span>Azalea Raad, Ori Lahav, and Viktor Vafeiadis. 2020a. Persistent Owicki-Gries Reasoning: A Program Logic for Reasoning about Persistent Programs on Intel-X86. Proc. ACM Program. Lang. 4, OOPSLA, Article 151 (nov 2020), 28 pages. <https://doi.org/10.1145/3428219>
- <span id="page-26-6"></span>Azalea Raad, Luc Maranget, and Viktor Vafeiadis. 2022. Extending Intel-X86 Consistency and Persistency: Formalising the Semantics of Intel-X86 Memory Types and Non-Temporal Stores. Proc. ACM Program. Lang. 6, POPL, Article 22 (jan 2022), 31 pages. <https://doi.org/10.1145/3498683>
- <span id="page-26-7"></span>Azalea Raad and Viktor Vafeiadis. 2018. Persistence Semantics for Weak Memory: Integrating Epoch Persistency with the TSO Memory Model. Proc. ACM Program. Lang. 2, OOPSLA, Article 137 (Oct. 2018), 27 pages. <https://doi.org/10.1145/3276507>
- <span id="page-26-8"></span>Azalea Raad, John Wickerson, Gil Neiger, and Viktor Vafeiadis. 2020b. Persistency Semantics of the Intel-X86 Architecture. Proc. ACM Program. Lang. 4, POPL, Article 11 (Dec. 2020), 31 pages. <https://doi.org/10.1145/3371079>
- <span id="page-26-9"></span>Azalea Raad, John Wickerson, and Viktor Vafeiadis. 2019b. Weak Persistency Semantics from the Ground Up: Formalising the Persistency Semantics of ARMv8 and Transactional Models. Proc. ACM Program. Lang. 3, OOPSLA, Article 135 (Oct. 2019), 27 pages. <https://doi.org/10.1145/3360561>
- <span id="page-26-1"></span>Renato J. Recio, Paul R. Culley, Dave Garcia, Bernard Metzler, and Jeff Hilland. 2007. A Remote Direct Memory Access Protocol Specification. RFC 5040. <https://doi.org/10.17487/RFC5040>
- <span id="page-26-10"></span>Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (San Jose, California, USA) (PLDI '11). Association for Computing Machinery, New York, NY, USA, 175–186. [https://doi.org/10.](https://doi.org/10.1145/1993498.1993520) [1145/1993498.1993520](https://doi.org/10.1145/1993498.1993520)
- <span id="page-26-11"></span>Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors. Commun. ACM 53, 7 (July 2010), 89–97. [https://doi.org/10.](https://doi.org/10.1145/1785414.1785443) [1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443)
- <span id="page-26-0"></span>Alexander Shpiner, Eitan Zahavi, Omar Dahley, Aviv Barnea, Rotem Damsker, Gennady Yekelis, Michael Zus, Eitan Kuta, and Dean Baram. 2017. RoCE Rocks without PFC: Detailed Evaluation. In Proceedings of the Workshop on Kernel-Bypass Networks (Los Angeles, CA, USA) (KBNets '17). Association for Computing Machinery, New York, NY, USA, 25–30. <https://doi.org/10.1145/3098583.3098588>
- <span id="page-26-12"></span>SPARC. 1992. The SPARC Architecture Manual: Version 8. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- <span id="page-26-2"></span>S. Van Doren. 2019. Abstract - HOTI 2019: Compute Express Link. In 2019 IEEE Symposium on High-Performance Interconnects (HOTI) (HOTI 2019). 18–18. <https://doi.org/10.1109/HOTI.2019.00017>
- <span id="page-26-3"></span>Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-Memory Transaction Processing Using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (Monterey, California) (SOSP '15). Association for Computing Machinery, New York, NY, USA, 87–104. <https://doi.org/10.1145/2815400.2815419>
- <span id="page-26-18"></span>Shale Xiong, Andrea Cerone, Azalea Raad, and Philippa Gardner. 2020. Data Consistency in Transactional Storage Systems: A Centralised Semantics. In 34th European Conference on Object-Oriented Programming (ECOOP 2020) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 166), Robert Hirschfeld and Tobias Pape (Eds.). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 21:1–21:31. <https://doi.org/10.4230/LIPIcs.ECOOP.2020.21>

<span id="page-27-0"></span>

<span id="page-27-1"></span>Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (London, United Kingdom) (SIGCOMM '15). Association for Computing Machinery, New York, NY, USA, 523–536. [https://doi.org/10.1145/2785956.](https://doi.org/10.1145/2785956.2787484) [2787484](https://doi.org/10.1145/2785956.2787484)

Received 2024-04-05; accepted 2024-08-18