Perfectly Oblivious (Parallel) RAM Revisited, and Improved Constructions*

T-H. Hubert Chan†
HKU
hubert@cs.hku.hk

Elaine Shi‡
CMU
runting@cs.cmu.edu

Wei-Kai Lin§
Cornell
wklin@cs.cornell.edu

Kartik Nayak¶
Duke
kartik@cs.duke.edu

Abstract

Oblivious RAM (ORAM) is a technique for compiling any RAM program to an oblivious counterpart, i.e., one whose access patterns do not leak information about the secret inputs. Similarly, Oblivious Parallel RAM (OPRAM) compiles a parallel RAM program to an oblivious counterpart. In this paper, we care about ORAM/OPRAM with perfect security, i.e., the access patterns must be identically distributed no matter what the program’s memory request sequence is. In the past, two types of perfect ORAMs/OPRAMs have been considered: constructions whose performance bounds hold in expectation (but may occasionally run more slowly); and constructions whose performance bounds hold deterministically (even though the algorithms themselves are randomized).

In this paper, we revisit the performance metrics for perfect ORAM/OPRAM, and show novel constructions that achieve asymptotical improvements for all performance metrics. Our first result is a new perfectly secure OPRAM scheme with $O(\log^3 N / \log \log N)$ expected overhead. In comparison, prior literature has been stuck at $O(\log^2 N)$ for more than a decade.

Next, we show how to construct a perfect ORAM with $O(\log^3 N / \log \log N)$ deterministic simulation overhead. We further show how to make the scheme parallel, resulting in an perfect OPRAM with $O(\log^4 N / \log \log N)$ deterministic simulation overhead. For perfect ORAMs/OPRAMs with deterministic performance bounds, our results achieve subexponential improvement over the state-of-the-art. Specifically, the best known prior scheme incurs more than $\sqrt{N}$ deterministic simulation overhead (Raskin and Simkin, Asiacrypt’19); moreover, their scheme works only for the sequential setting and is not amenable to parallelization.

Finally, we additionally consider perfect ORAMs/OPRAMs whose performance bounds hold with high probability. For this new performance metric, we show new constructions whose simulation overhead is upper bounded by $O(\log^3 / \log \log N)$ except with negligible in $N$ probability, i.e., we prove high-probability performance bounds that match the expected bounds mentioned earlier.

*Author ordering is randomized.
†T-H. Hubert Chan was partially supported by the Hong Kong RGC under the grants 17200418 and 17201220.
‡Elaine Shi was partially supported by NSF CNS-1601879, an ONR YIP award, and a Packard Fellowship.
§Wei-Kai Lin was supported by a DARPA Brandeis award.
¶Kartik Nayak was partially supported by NSF Award 2016393.
1 Introduction

Oblivious RAM (ORAM) is an algorithmic construction that provably obfuscates a (parallel) program’s access patterns. It was first proposed in the ground-breaking work by Goldreich and Ostrovsky [GO96, Gol87], and its parallel counterpart Oblivious Parallel ORAM (OPRAM) was proposed by Boyle et al. [BCP16]. ORAM and OPRAM are fundamental building blocks for enabling various forms of secure computation on sensitive data, e.g., either through trusted-hardware [RYF+13, FRY+14, MLS+13, LHM+15] or relying on cryptographic multi-party computation [GKK+12, LWN+15]. Since the initial proposal, ORAM and OPRAM have attracted much interest from various communities, and there has been a line of work dedicated to understanding their asymptotic and concrete efficiencies. It is well-known [GO96, Gol87, LN18] that any O(P)RAM scheme must incur at least a logarithmic overhead (also known as simulation overhead) in (parallel) runtime relative to the insecure counterpart. On the other hand, ORAM/OPRAM schemes with poly-logarithmic overhead have been known [GO96, Gol87, GM11, KLO12, SCSL11, SvDS+13, WCS15, PPRY18], and the very recent exciting work of Asharov et al. [AKL+20] showed how to match the logarithmic lower bound in the sequential ORAM setting, assuming the existence of one-way functions and a computationally bounded adversary. Throughout this paper, we use the standard notion of simulation overhead originally suggested by Goldreich and Ostrovsky [GO96, Gol87]: if the original RAM/PRAM’s (parallel) runtime is $T$ and the corresponding ORAM/OPRAM’s (parallel) runtime is $\chi T$, we say that the ORAM/OPRAM has simulation overhead $\chi$.

Motivation for perfectly secure ORAMs/OPRAMs. With the exception of very few works, most of the literature has focused on either computationally secure [GO96, Gol87, GM11, KLO12, CGLS17, PPRY18, AKL+20] or statistically secure [Ajt10, SCSL11, SvDS+13, CLP14, WCS15] ORAMs. A computationally secure (or statistically secure, resp.) ORAM guarantees that for any two request sequences of the same length, the access patterns incurred are computationally (or statistically resp.) indistinguishable. Most known computationally secure or statistically secure schemes [GO96, Gol87, SCSL11, SvDS+13, WCS15, BCP16, CS17] suffer from a small failure probability that is negligible in the ORAM’s size henceforth denoted $N$ while achieving poly log $N$ overhead. If the ORAM/OPRAM’s size is large, say, $N \geq \lambda$ for some desired security parameter $\lambda$, then the failure probability would also be negligible in the security parameter. Unfortunately, for small choices of $N$ (e.g., $N = \text{poly} \log \lambda$), these schemes actually give polylogarithmic overhead in security parameter $\lambda$ (and not in $N$) to achieve a negl($\lambda$) security failure probability — note that a poly log $\lambda$ overhead equals to $N^{\Theta(1)}$ for this parameter regime, and thus the dependence on $N$ is undesirable. Even though at first sight, it seems like we might not care about the parameter regime when $N$ is much smaller than $\lambda$; as it turns out, such a small-N ORAM/OPRAM (with polylogarithmic in $N$ overhead) was needed in many scenarios, such as in the construction of searchable encryption schemes [DPP18], oblivious algorithms [SCSL11, PPRY18, ACN+19], including notably, the recent OptORAMa work [AKL+20] that constructed an optimal ORAM.

The study of perfectly secure ORAMs/OPRAMs is partly motivated by the aforementioned mismatch. Moreover, recall also that perfect security has long been a topic of interest in the multi-party computation and zero-knowledge proof literature [IKO+11, GIW16], and its theoretical importance widely-accepted. Historically, perfect security is viewed as attractive since 1) the security holds in any computational model even if quantum computers or other forms of computers can be built; and 2) perfectly secure schemes often have clean compositional properties. Therefore, another natural application of perfectly secure ORAM/OPRAM is to construct efficient perfectly secure, RAM-model MPC.

\footnote{For the parallel setting, how to achieve optimality remains open.}
Does there exist a perfectly secure ORAM/OPRAM with \(o(\log^3 N)\) overhead? Despite the sustained and lively progress in understanding the asymptotic overhead of computationally and statistically secure ORAMs/OPRAMs, our understanding of perfectly secure ORAMs/OPRAMs has been somewhat stuck. In general, few results are known in the perfect security regime: in 2011, Damgård et al. [DMN11] first showed a perfectly secure ORAM scheme with \(O(\log^3 N)\) expected simulation overhead and \(O(\log N)\) server space blowup, where the space blowup is the ratio between the server space consumed by the scheme compared to the insecure space \(N\). Recently, Chan et al. [CNS18] show an improved and simplified construction that removed the \(\log N\) server space blowup; and moreover, they showed how to extend the approach to the parallel setting resulting in a perfectly secure OPRAM scheme with \(O(\log^3 N)\) expected overhead. There is no known super-logarithmic lower bound for perfect security, and thus we do not understand yet whether the requirement of perfect security would inherently incur more overhead than computationally secure ORAMs. Therefore, an exciting and extremely challenging open direction is to understand the exact asymptotic complexity of perfectly secure ORAMs and OPRAMs, that is, to seek a matching upper- and lower-bound. This is a very ambitious goal and in this paper, we aim to take the next natural step forward. Since all prior upper bounds seem stuck at \(O(\log^3 N)\), we ask the following natural question: does there exist an ORAM/OPRAM with \(o(\log^3 N)\) asymptotic overhead?

The large gap between expected and deterministic performance bounds. To achieve perfect security, the prior perfect ORAM/OPRAM constructions pay a price: specifically their algorithms are Las Vegas, and the stated \(O(\log^3 N)\) overhead is in an expected sense. Their ORAMs can occasionally run longer than \(O(\log^3 N)\) time if certain unlucky events happen (where the unlucky events are identically distributed for all inputs so that the scheme remains perfectly secure). Moreover, the smaller the choice of \(N\), the more likely that the ORAM can run much longer than the expectation.

Raskin et al. [RS19] (Asiacrypt’19) recently pointed out that this issue was somewhat shoved under the rug in prior works on perfect ORAMs/OPRAMs, and they were the first to ask how to construct perfect ORAMs with deterministic performance bounds. To avoid confusion, note that all ORAM schemes with non-trivial efficiency must be randomized algorithms; however, their performance bounds can be made deterministic (i.e., deterministic performance bounds does not mean that the algorithm is deterministic). Raskin et al. showed a perfectly secure ORAM with a deterministic simulation overhead \(O(\sqrt{N} \log^N N / \log \log N)\) (assuming \(O(1)\) client-side storage).\(^2\) While conceptually interesting, in comparison with the \(O(\log^3 N)\) schemes [DMN11][CNS18], the price to obtain deterministic bounds seems high. Therefore, another natural question is, does there exist perfectly secure ORAMs/OPRAMs with deterministic polylogarithmic overhead?

1.1 Our Results and Contributions

Our contributions are two-fold. First, following Raskin et al., we make another effort at systematizing the performance metrics for perfect ORAM/OPRAMs. Besides expected and deterministic performance bounds, we additionally consider the notion of high-probability performance bounds (explained below). Second, we show novel perfect ORAM/OPRAM constructions with asymptotical performance improvements for all three types of performance metrics: expected, high-probability, and deterministic.

Revisiting the performance metrics for perfectly secure ORAMs/OPRAMs. We consider the following performance bounds for ORAM/OPRAMs:

\(^2\)Their overhead can be improved to \(O(\sqrt{N})\) if we allowed \(O(\sqrt{N})\) client-side storage.
1. **Expected performance bounds.** Suppose that the original RAM/PRAM runs in (parallel) time $T$, and the corresponding ORAM/OPRAM runs in expected (parallel) time $\chi \cdot T$, then we say that the ORAM/OPRAM satisfies expected simulation overhead (or expected overhead) $\chi$.

2. **High-probability performance bounds.** Suppose that the original RAM/PRAM runs in (parallel) time $T$, and the corresponding ORAM/OPRAM runs in (parallel) time $\chi \cdot T$ with $1 - \delta$ probability where $\delta$ is suitably small (e.g., negligibly small in some security parameter), then, we say that ORAM/OPRAM satisfies simulation overhead (or overhead) $\chi$ with probability $1 - \delta$. We stress that the failure probability $\delta$ describes the probability that the ORAM/OPRAM exceeds the performance bounds, it does not describe security failure since the ORAM/OPRAM is perfectly secure. This is a natural, intermediate notion that is not as stringent as deterministic performance bounds and yet gives a strong guarantee. This notion may permit asymptotically better schemes than insisting on deterministic performance bounds.

3. **Deterministic performance bounds.** Suppose that the original RAM/PRAM runs in (parallel) time $T$, and the corresponding ORAM/OPRAM runs in (parallel) time $\chi \cdot T$ with probability 1, then we say that the ORAM/OPRAM satisfies deterministic simulation overhead $\chi$.

### Asymptotically better constructions for all performance metrics.

We show novel perfect ORAM/OPRAM constructions that achieve asymptotical performance improvements across the board.

- First, for expected performance, we overcome the $\log^3 N$ barrier that the literature has been stuck at for the past decade. Our perfect ORAM/OPRAM scheme has $O(\log^3 N/\log \log N)$ expected overhead.
- Second, for high-probability performance bounds, previously, there were no documented schemes with this type of performance bounds to the best of our knowledge. We show new perfectly secure ORAM/OPRAMs that achieve $O(\log^3 N/\log \log N)$ simulation overhead with probability $1 - \text{negl}(N)$.
- Finally, we construct perfect ORAM/OPRAMs with deterministic, poly-logarithmic simulation overhead. Our result achieves a sub-exponential performance improvement relative to prior art [RS19].

Our results are summarized in the following theorems, and moreover, Table 1 gives an explicit comparison of our results with prior work.

**Theorem 1.1** (Informal: perfect OPRAM with expected performance bounds). *There exists a perfectly secure OPRAM scheme that consumes only $O(1)$ blocks of client private cache and $O(N)$ blocks of server-space; moreover the scheme achieves $O(\log^3 N/\log \log N)$ expected simulation overhead.*

**Theorem 1.2** (Informal: perfect OPRAM with high-probability performance bounds). *There exists a perfectly secure OPRAM scheme that consumes only $O(1)$ blocks of client private cache and $O(N)$ blocks of server-space; moreover the scheme achieves $O\left(\frac{1}{\log \log N} \cdot (\log^3 N + \log^2 N \cdot \log^2 \log(1/\delta) + \log N \cdot \log^3 \log(1/\delta))\right)$ simulation overhead with probability $1 - \delta$. 

---

$^3$Note that our formal theorem statement gives a parametrized version where the performance failure probability may be any free parameter.
Table 1: Comparison of our results with prior work. For simplicity, the high-probability bounds are parameterized to $1 - \text{negl}(N)$ failure probability.

<table>
<thead>
<tr>
<th>Schemes with expected performance bounds</th>
<th>Space</th>
<th>ORAM overhead</th>
<th>OPRAM overhead</th>
</tr>
</thead>
<tbody>
<tr>
<td>Damgård et al. [DMN11]</td>
<td>$O(N \log N)$</td>
<td>$O(\log^3 N)$</td>
<td>N/A</td>
</tr>
<tr>
<td>Chan et al. [CNS18]</td>
<td>$O(N)$</td>
<td>$O(\log^2 N)$</td>
<td>$O(\log^3 N)$</td>
</tr>
<tr>
<td>This work</td>
<td>$O(N)$</td>
<td>$O(\log^3 N/\log \log N)$</td>
<td>$O(\log^3 N/\log \log N)$</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Schemes with high-probability performance bounds</th>
<th>Space</th>
<th>ORAM overhead</th>
<th>OPRAM overhead</th>
</tr>
</thead>
<tbody>
<tr>
<td>This work</td>
<td>$O(N)$</td>
<td>$O(\log^3 N/\log \log N)$</td>
<td>$O(\log^3 N/\log \log N)$</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Schemes with deterministic performance bounds</th>
<th>Space</th>
<th>ORAM overhead</th>
<th>OPRAM overhead</th>
</tr>
</thead>
<tbody>
<tr>
<td>Raskin et al. [RS19]</td>
<td>$O(N)$</td>
<td>$O(\sqrt{N} \cdot \log N)$</td>
<td>N/A</td>
</tr>
<tr>
<td>This work</td>
<td>$O(N)$</td>
<td>$O(\log^3 N/\log \log N)$</td>
<td>$O(\log^3 N/\log \log N)$</td>
</tr>
</tbody>
</table>

**Theorem 1.3** (Informal: perfect OPRAM with deterministic performance bounds). There exists a perfectly secure ORAM scheme that achieves $O(\log^3 N/\log \log N)$ simulation overhead with probability 1. For the parallel setting: there exists a perfectly secure OPRAM scheme that achieves $O(\log^4 N/\log \log N)$ simulation overhead with probability 1.

For both the ORAM/OPRAM schemes above, we need only $O(1)$ blocks of client private cache and $O(N)$ blocks of server-space.

1.2 Technical Highlight

**Getting an $O(\log^3 N/\log \log N)$ deterministic overhead ORAM.** To improve the overhead of perfectly secure ORAMs to $O(\log^3 N/\log \log N)$, our techniques are inspired by the rebalancing trick of Kushilevitz et al. [KLO12] (SODA’12), and yet departs significantly from Kushilevitz et al. Namely, the existing perfect ORAM/OPRAM constructions of Chan et al. [CNS18] consist of an “online fetch phase” and an “offline maintain phase,” the fetch phase takes a logical request (as an input to the ORAM) and answers to the request using a data structure, and then the maintain phase reshuffles the data structure. We observe that the maintain and fetch phases suffer from an imbalance; specifically, the offline maintain phase costs $O(\log^3 N)$ per request whereas the online fetch phase costs only $O(\log^2 N)$. A natural idea is to modify the scheme and rebalance the costs of the offline maintain phase and the online fetch phase, such that both phases would cost only $O(\log^3 N/\log \log N)$. Unfortunately, existing techniques such as Kushilevitz et al. completely fail for rebalancing perfect ORAMs/OPRAMs — we describe the technical reasons in detail in Appendix A.

Our starting point is the perfect ORAM construction by Chan et al. [CNS18] in which the maintain phase costs $O(\log^3 N)$ and the fetch phase costs only $O(\log^2 N)$. Specifically, their construction consists of $D = O(\log N)$ number of ORAMs such that except for the last ORAM which stores the actual data blocks, every other ORAM serves as a (recursive) index structure into the next ORAM — for this reason, these $D$ ORAMs are also called $D$ recursion depths; and all of the recursion depths jointly realize an implicit logical index structure that is in fact isomorphic to a binary tree (which has a branching factor of 2).

First, we show how to use a fat-block trick to increase the branching factor and hence reduce the number of recursion depths by a $\log \log N$ factor. Specifically, we increase the implicit index structure’s branching factor from 2 (i.e., storing two pointers or position labels in the next recursion depth) to $\log N$. Thus a fat-block is a bundle of logarithmically many normal blocks and hence each fat-block can store logarithmically many pointers. While this reduces the recursion depth by a $\log \log N$ factor, the fetch phase now costs a logarithmic factor more per recursion depth (since
obliviously accessing a fat-block is a logarithmic factor more costly than accessing a normal block).

The primary challenge is to realize the maintain phase such that the amortized per-depth maintain-phase cost preserves the asymptotics, despite the fat-block now being logarithmically fatter. To accomplish this we rely on the following two key insights:

1. Exploit residual randomness. First, we rely on an elegant observation first made in the PanORAMa work [PPRY18] in the context of computationally secure ORAMs. Here we make the same observation for perfectly secure ORAMs. At the core of Chan et al.’s ORAM construction is a data structure called an oblivious “one-time-memory” (OTM). When an OTM is initialized, all elements in it are randomly permuted (and the randomness concealed from the adversary) — note that in our setting, each element is a fat-block. The critical observation is that after accessing a subset of the elements in this OTM data structure, the remaining unvisited elements still appear in a random order. By exploiting such residual randomness, when we would like to build a new OTM consisting of the remaining unvisited elements, we can avoid performing expensive oblivious permutation (which would take time $O(n \log n)$ to sort $n$ elements) and instead rely on linear-time operations.

2. Exploit sparsity. In the construction of Chan et al., the $D$ ORAMs at all recursion depths must perform a “coordinated shuffle” operation during the maintain phase. An important step in this coordinated shuffle is (for each recursion depth) to inform the parent depth the locations of its fat-blocks after the reshuffle. In Chan et al., two adjacent recursion depths perform such “communication” through oblivious sorting, thus incurring $O(n \log n)$ cost per-depth to rebuild a data structure of size $n$.

Our key observation is that the fat-blocks contained in each OTM data structure in each recursion depth are sparsely populated. In fact, most entries in the fat-blocks are irrelevant and only a $1/\log N$ fraction of them are populated. Thus, we employ an oblivious tight compaction [AKL+20a] to compress away the wasted space, where tight compaction is a degenerated sorting that sorts elements tagged with 0/1 keys. After this compression, the OTM becomes logarithmically smaller and we can apply an oblivious sort.

Finally, we stress that to get an ORAM with deterministic performance bounds, we will need to instantiate our ORAM with building blocks that give deterministic performance. We defer the details to later technical sections.

Parallelizing the scheme. To parallelize our ORAM scheme, we encounter several additional technicalities. Some of the core algorithmic building blocks can be parallelized; however, to preserve the asymptotical total work of the sequential versions, the only known parallel counterparts give Las Vegas-type performance bounds. This would be fine if we only wanted an OPRAM with expected $O(\log^3 N/ \log \log N)$ overhead. However, more work is needed to get an OPRAM whose simulation overhead is $O(\log^3 N/ \log \log N)$ with high probability. Finally, to get an OPRAM whose simulation overhead holds deterministically, we need to replace some of the oblivious parallel building blocks with ones with deterministic performance bounds — here we lose an extra logarithmic factor in total work. We defer a more detailed exposition of these technicalities to later sections.

2 Technical Overview

We start with an informal and intuitive exposition of our main technical ideas. For simplicity, most of the section describes how to get the sequential ORAM result with deterministic $O(\log^3 N/ \log \log N)$ simulation overhead. Then, in Section 2.4 we sketch the additional steps
and technicalities needed to parallelize the scheme to get the expected, high-probability, and deterministic performance bounds for OPRAM. The full formal details will be deferred to the technical sections later.

2.1 Background on Perfect ORAM

The goal of an ORAM scheme is to simulate to the client a memory array of $N$ blocks, where each block consists of $\Omega(\log N)$ bits. In the simulated memory, each block is indexed by a logically address in $\{0,1,\ldots,N-1\}$, and the client can read or write a block using a logical address. The ORAM is allowed to use client-side storage of only $O(1)$ number of blocks as well as server storage, where the server supports only fetch or store the content of blocks (but no computation). By perfect security, we require that the sequence of accessed blocks on the server (also called the access pattern) be identically distributed for any sequence of read/write accesses to the memory simulated by ORAM. Such settings are standard in previous works [DMN11,CNS18,RS19].

In a recent work, Chan et al. [CNS18] propose a perfectly secure ORAM with $O(\log^3 N)$ simulation overhead. At a high level, their scheme is inspired by the hierarchical ORAM paradigm by Goldreich and Ostrovsky [GO96,Gol87], but Chan et al. replace the “oblivious hashing” (which has a negligible statistical imperfectness) with perfectly secure data structures (including a “one-time-memory” as well as a “position map”). In this way, they also remove the pseudo-random function (PRF) in Goldreich and Ostrovsky’s construction [GO96,Gol87].

2.1.1 Position-based Hierarchical ORAM

First, imagine that the client can store per-block metadata and we will later remove this strong assumption through a non-blackbox recursion technique. Specifically, imagine that the client remembers where exactly each block is residing on the server. In this case, we can construct a perfect ORAM as follows — henceforth this building block is called “position-based ORAM” since we assume that the correct position label for every requested block is known to the client.

Hierarchical levels. The server-side data structure is organized into $\log N + 1$ levels numbered $0,1,\ldots,\log N$ where level $i$ is either 1. empty, in which case it stores no blocks; or 2. full, in which case the level stores $2^i$ real blocks plus $2^i$ dummy blocks in a randomly permuted fashion (we also say that the level has capacity $2^i$). Each block, whose logical addresses range from 0 to $N-1$, resides in exactly one of the levels at a random position within the level.

Fetch phase. Every time a block with address $\text{addr}$ is requested, the client looks up the block’s position. Suppose that the block resides in the $j$-th position of level $\ell$. The client now visits for one block per full level from the server — note that the levels are visited in a fixed order from 0 to $\log N$:

- for level $\ell$ (i.e., where the desired block resides), the client reads precisely the $j$-th position to fetch the real block; it then marks the visited position as visited;
- for every other level $\ell' \neq \ell$, the client reads a random unvisited dummy block (and marks the corresponding block on the server as visited for obliviousness).

Maintain phase. Every time a block $B$ has been fetched by the client, it updates the block $B$’s contents if this is a write request, and then it puts $B$ back to level 0 as an unvisited block so that level 0 becomes full. Now, suppose levels $0,1,\ldots,\ell^*$ are all full and either level $\ell^* + 1$ is empty or $\ell^* = \log N$. The client will now merge levels $0,1,\ldots,\ell^*$ into the “target” level $\ell_{tgt} :=$
\[ \min(\ell^* + 1, \log N) \]. This procedure is called “rebuilding” level \( \ell_{\text{tgt}} \). At the end of the rebuild, it marks level \( \ell_{\text{tgt}} \) as full and every smaller level as empty.

To merge consecutively full levels into the next empty level (or the largest level), the goal is to implement the following ideal functionality obliviously:

1. extract all unvisited real blocks to be merged and place them in an array called \( A \);
2. pad \( A \) with dummy blocks to a length of \( 2 \cdot 2^{\ell_{\text{tgt}}} \) blocks and randomly permute the resulting array.

Chan et al. show how to achieve the above obliviously — even though the client has only \( O(1) \) blocks of client storage — through oblivious sorting (using the AKS sorting network \([AKS83]\)). The cost of rebuilding a level of capacity \( n \) is dominated by the oblivious sorting on \( O(n) \) blocks, which has a cost of \( O(n \log n) \).

Note that the above construction guarantees that whenever a real block is accessed, it is moved into a smaller level. Thus, in every level, each real or dummy block is accessed at most once before the level is rebuilt; and this is important for obliviousness. For this reason, later in our technical sections, we name each level in this hierarchy an oblivious “one-time memory”. Note also that the number of dummies in a level must match the total number of accesses the level can support before it is rebuilt again.

**Additional details about dummy positions.** The above description implicitly assumed that for a level the desired block does not reside in, the client is informed of the position of a random unvisited dummy block. If the client does not store this information locally, it can construct a (per-level) metadata array \( M \) on the server every time a level is rebuilt. When a block is being requested, the client can sequentially scan the metadata array at every level (including the level where the desired block resides) to discover the location of the next unvisited dummy (residing at a random unvisited location in the level).

As Chan et al. show, such a dummy metadata array can be constructed with \( O(n \log n) \) overhead using oblivious sorting too, at the same time when a level of capacity \( n \) is rebuilt.

**Overhead.** Summarizing, in the position-based ORAM, after every \( 2^\ell \) requests, the level \( \ell \) will be rebuilt, paying \( O(2^\ell \cdot \log(2^\ell)) \) cost. Amortizing the total cost over the sequence of requests, it is not difficult to see that the average cost per request is \( O(\log^2 N) \).

### 2.1.2 Recursive Position Map

So far we have assumed that the client magically remembers where exactly each block is residing on the server. To remove this assumption, Chan et al. propose to recursively store the blocks’ position labels in smaller ORAMs until the ORAM’s size becomes constant, resulting in \( D = O(\log N) \) ORAMs henceforth denoted \( \text{ORAM}_0, \text{ORAM}_1, \ldots, \text{ORAM}_D \) respectively, where \( \text{ORAM}_i \) stores the position labels of all blocks in \( \text{ORAM}_{i+1} \) for \( i \in \{0,1,\ldots,D\} \). We often call \( \text{ORAM}_D \) the “data ORAM” and every other ORAM a “metadata ORAM”; we also refer to the index \( i \) as the depth of \( \text{ORAM}_i \). Now, suppose that each block can store \( \Omega(\log N) \) bits of information, such that we can pack the position labels of at least 2 blocks into a single block. In this case, each \( \text{ORAM}_i \) is twice smaller in capacity than \( \text{ORAM}_{i+1} \) and thus \( \text{ORAM}_0 \) would be of \( O(1) \) size — thus operations to \( \text{ORAM}_0 \) can be supported trivially by scanning through the whole \( \text{ORAM}_0 \) consuming only constant cost, and the total space is still \( O(N) \).

As Chan et al. show, in the hierarchical ORAM context such a recursion idea does not work
in a straightforward blackbox manner but needs a special “coordinated rebuild” technique which we now explain. Henceforth, suppose that each block’s logical address $\text{addr}$ is $\log N$ bits long, and we use the notation $\text{addr}^{(d)}$ to denote the address $\text{addr}$, written in binary format, truncated to the most significant $d$ bits.

- **Fetch phase (straightforward):** To fetch a block at some logical address $\text{addr}$, the client looks up logical address $\text{addr}^{(d)}$ in each $\text{ORAM}_d$ for $d = 0, 1, \ldots, D$ sequentially. Since the block at logical address $\text{addr}^{(d)}$ in $\text{ORAM}_d$ stores the position labels for the two blocks at logical addresses $\text{addr}^{(d)} || 0$ and $\text{addr}^{(d)} || 1$ in $\text{ORAM}_{d+1}$, the client is always able to find out the position of the block in the next recursion depth before it performs a lookup there.

- **Maintain phase (coordinated rebuild):** The maintain phase needs special treatment such that the rebuilds at all recursion depths are coordinated. Specifically, whenever the data $\text{ORAM}_D$ is rebuilding the level $\ell$, each other recursion depth $\text{ORAM}_d$ would be rebuilding level $\min(\ell, d)$ in a coordinated fashion — note that each $\text{ORAM}_d$ has only $d$ levels.

The main goal of the coordination is for each $\text{ORAM}_d$ to pass the blocks’ updated position labels back to the parent depth $\text{ORAM}_{d-1}$. More specifically, recall that when $\text{ORAM}_d$ rebuilds a level $\ell$, all real blocks in the level would now be placed at a new random position. When these new positions have been decided, $\text{ORAM}_d$ must inform the corresponding metadata blocks in $\text{ORAM}_{d-1}$ the new position labels. The coordinated rebuild is possible due to the following invariant which is not hard to observe (recall that $\text{addr}^{(d)}$ is the block that stores the position labels for the block $\text{addr}^{(d+1)}$ in $\text{ORAM}_{d+1}$):

For every $\text{addr}$, the block at address $\text{addr}^{(d)}$ in $\text{ORAM}_d$ is always stored at a smaller or equal level relative to the level of the block at address $\text{addr}^{(d+1)}$ in $\text{ORAM}_{d+1}$.

Chan et al. show how to use oblivious sorting to perform a coordinated rebuild, paying $O(n \log n)$ to pass the new position labels of level-$\ell$ in $\text{ORAM}_d$ to the parent $\text{ORAM}_{d-1}$ where $n = 2^\ell$ is the level’s capacity.

### 2.1.3 Analysis

It is not hard to see that the entire fetch phase consumes $O(\log^2 N)$ overhead where one $\log N$ factor comes from the number of levels within each recursion depth, and another comes from the number of recursion depths. The maintain phase, on the other hand, consumes $O(\log^3 N)$ amortized cost where one logarithmic factor arises from the number of recursion depths, one arises from the number of levels within each depth, and the last one stems from the additional logarithmic factor in oblivious sorting.

To asymptotically improve the overhead, one promising idea is to somehow balance the fetch and maintain phases. This idea has been explored in computationally secure ORAMs first by Kushilevitz et al. [KLO12] and later improved in subsequent works [CGLS17]. Unfortunately as we explain in Appendix A, Kushilevitz et al.’s rebalancing trick is not compatible with known perfect ORAMs. Thus we need fundamentally new techniques for realizing such a rebalancing idea.

### 2.2 Building Blocks

Before we introduce our new algorithms, we describe two important oblivious algorithms as building blocks that were discovered in very recent works [Pes18,AKL+20a].

\footnote{Roughly speaking, it is because each logical access on $\text{ORAM}_{i+1}$ would have incurred too many accesses on $\text{ORAM}_i$, and then the cost of such recursion would have been too expensive.}
Tight compaction. Tight compaction is the following task: given an input array containing \( m \) balls where each ball is tagged with a bit indicating whether it is real or dummy, produce an output array containing also \( m \) balls such that all real balls in the input appear in the front and all dummies appear at the end.

In a very recent work called OptORAMa \([AKL+20a]\), the authors show how to accomplish tight compaction obliviously in \( O(m) \) time. Their algorithm can be expressed as a linear-sized circuit (of constant fan-in and fan-out), consisting only of boolean gates and swap gates, where a boolean gate can perform boolean computations on two input bits; and a swap gate takes in a bit and two balls, and decides to either swap or not swap the two balls.

Intersperse. The same work OptORAMa described another randomized oblivious algorithm called “intersperse”, which accomplishes the following task in deterministic linear time: given two randomly shuffled input arrays \( I \) and \( I' \) (where the permutations used in the shuffles are hidden from the adversary), create an output array of length \( |I| + |I'| \) that contains all elements from the two input arrays, and moreover, all elements in the output array are randomly shuffled in the view of the adversary.

2.3 A New Rebalancing Trick for Perfectly Secure ORAMs

We propose new techniques for instantiating such a rebalancing trick. Our idea is to introduce a notion called a fat-block. A fat-block is a bundle of \( \chi := \log N \) normal blocks; thus to access a fat-block requires paying \( \chi = \log N \) cost.

Imagine that in each metadata ORAM, the atomic unit of storage is a fat-block (rather than a normal block). Since each fat-block can pack \( \chi = \log N \) position labels, the depth of the recursion is now \( \log_\chi N = \log N / \log \log N \), i.e., a \( \log \log N \) factor smaller than before (see Section 2.1.2).

More concretely, a metadata ORAM \( \text{ORAM}_d \) at depth \( d \) stores a total of \( \chi^d \) metadata fat-blocks — for the time being we assume that \( N \) is a power of \( \chi \) for simplicity, and let \( D := \log_\chi N + 1 \) be the number of recursion depths such that the total storage is still \( O(N) \) blocks (our idea can be generalized to the case when \( N \) is not a power of \( \chi \)). Within each \( \text{ORAM}_d \), as before, we have a total of \( d \log \chi + 1 \) levels where each level \( \ell \) can store \( 2^\ell \) fat-blocks.

It is not hard to see that the fetch phase would now incur \( O(\log^3 N / \log \log N) \) cost across all recursion depths — in comparison with before, the extra \( \log N \) factor arises from the cost of reading a fat-block, and the \( \log \log N \) factor saving comes from the \( \log \log N \) saving in recursion depth.

Our hope is that now with the smaller recursion depth, we can accomplish the maintain phase in amortized \( O(\log^2 N / \log \log N) \) cost. Recall that each level \( \ell \) in a metadata \( \text{ORAM}_d \) now contains \( 2^\ell \) fat-blocks. The crux is to rebuild a level containing \( 2^\ell \) fat-blocks in cost that is linear in the level’s total size, that is, \( 2^\ell \cdot \chi \). Note that if we naïvely used oblivious sorting on fat-blocks (like in Section 2.1.1) to accomplish this, the cost would have been \( 2^\ell \cdot \chi \cdot \log(2^\ell) \) which is more expensive than previous scheme and undesirable.

To resolve this challenge, the following two insights are critical:

- **Sparsity:** First, observe that each level in a metadata ORAM is sparsely populated: although the entire level, say, level \( \ell \), has the capacity to store \( 2^\ell \cdot \chi \) position labels, the level is rebuilt after every \( 2^\ell \) requests. Thus in fact only \( 2^\ell \) of these position label entries are populated.

- **Residual randomness:** The second important observation is that the unvisited fat-blocks contained in any level appear in a random order where the randomness of the permutation is hidden from the adversary — note that a similar observation was first made in the PanORAMa work \([PPRY18]\) by Patel et al.
More specifically, suppose that to start with, a level contains \( n \) fat-blocks including some reals and some dummies, and all of these \( n \) fat-blocks have been randomly permuted (where the randomness of the permutation is hidden from the adversary). As the client visits fat-blocks in a level, the adversary learns which blocks are visited. Now, among all the unvisited blocks, there are both real and dummy blocks and all these blocks are equally likely to appear in any order w.r.t. the adversary’s view.

We now explain how to rely on the above insights to rebuild a level containing \( n = 2^\ell \) fat-blocks in \( O(n \cdot \chi) \) time — note that at most half of these fat-blocks are real, and the remaining are dummy. From Section 2.1.2 we learned that to rebuild a level containing \( n \) fat-blocks, it suffices to realize the following functionality obliviously:

a) **Merge.** The first step of the rebuild is to merge consecutively full levels into the next empty level (or the largest level). After this merge step, this new level is marked full and every smaller level is marked empty.

b) **Permute.** After the above merge step, the resulting array containing \( n \) fat-blocks must be randomly permuted (and their positions after the permutation will then be passed to the parent depth next).

c) **Update.** After the permutation step, each real fat-block in the level (at a recursion depth \( d \)) whose logical address is \( \text{addr} \) must receive up to \( \chi \) updated positions from the child recursion depth, i.e., the fat-block at logical address \( \text{addr} \) wants to learn where the fat-blocks at logical addresses \( \text{addr}||0, \text{addr}||1, \ldots, \text{addr}||(\chi - 1) \) newly reside in the child depth \( d + 1 \).

d) **Create dummy metadata.** Finally, create a dummy metadata array to accompany this level: the dummy metadata array containing \( n \) entries where each entry is \( O(\log N) \) bits (note that an entry is a normal block, not a fat-block). This array should store the positions of all dummy fat-blocks contained in the level in a randomly permuted order.

**Realizing “merge + permute”.** We first explain how to accomplish the “merge + permute” steps. For simplicity we focus on explaining the case where consecutive full levels are merged into the next empty level (since it would be fine if the merging into the largest level alone is done naïvely using oblivious sort on all fat-blocks). Here it is important to rely on the residual randomness property mentioned earlier. Suppose the levels to be merged contain \( 1, 2, 4, 8, \ldots, n/2 \) fat-blocks respectively. Recall that in all of these levels to be merged, the unvisited blocks appear in a random order w.r.t. the adversary’s view. Thus, we can simply do \( O(\log n) \) cascading merges using **Intersperse** (see Section 2.2), every time merging two arrays each containing \( 2^i \) fat-blocks into an array containing \( 2^{i+1} \) fat-blocks, and the overall cost is \( O(n) \).

**Realizing “update”.** At this moment, let us not view the level as an array of \( n \) fat-blocks any more, but as an array of \( O(n \cdot \chi) \) position entries. For realizing the “update” step in \( O(n \cdot \chi) \) overhead, the key insight is to exploit the sparsity.

Recall that the problem we need to solve boils down to the following. There is a destination array \( D \) consisting \( O(n \cdot \chi) \) position entries among which \( O(n) \) entries are going to be updated, and we override terminologies “real” and “dummy” (opposed to previously denoted real or dummy fat-blocks) and say the to-be-updated \( O(n) \) entries are real while all remaining entries are dummy. Additionally, there is a source array \( S \) consisting of \( O(n) \) entries (which can be real or dummy). In both the source \( S \) and the destination \( D \), each real entry is of the form \((k, v)\) where \( k \) denotes a key and \( v \) denotes a payload value; further, in each array \( D \) or \( S \), every real entry must have a distinct key. Now, we would like to route each real entry \((k, v) \in S \) to the corresponding entry with the same key in the destination array \( D \).
Exploiting the sparsity in the problem definition, we devise the following algorithm where an important building block is linear-time oblivious tight compaction (see Section 2.2).

First, we rely on oblivious tight compaction to compact the destination array \( D \), resulting in a compacted array \( \tilde{D} \) consisting of only \( O(n) \) entries. Moreover, recall that oblivious tight compaction can be viewed as a circuit consisting of boolean gates and swap gates. When we compact the destination array \( D \), each swap gate remembers the routing decision since later it will be useful to run this circuit in the reverse direction. After the compaction, we can now afford to pay the cost of oblivious sorting. Hence, each entry in the source \( S \) can route itself to each entry in the compacted destination \( \tilde{D} \) — this can be accomplished through a standard technique called oblivious routing [BCP16, CS17], which has a cost of \( O(n \log n) \). Now, by running the aforementioned tight compaction circuit in the reverse direction, we can route each element of the compacted destination \( \tilde{D} \) back into the original destination array \( D \).

It is not difficult to see that the above steps require only \( O(n \cdot (\chi + \log n)) = O(n \log N) \) cost.

**Obliviously create dummy metadata array.** Finally, obliviously creating the dummy metadata array is easy: this can be accomplished by writing down \( O(\log N) \) bits of metadata per fat-block, and then performing a combination of oblivious random permutation and oblivious sort on the resulting metadata array. To get deterministic simulation overhead for our ORAM, we will need to use an oblivious random permutation algorithm with deterministic performance bounds — fortunately, this is known due to Asharov et al. [AKL+20a].

**Summary.** In the above, we took care to make sure that all oblivious building blocks used give deterministic performance bounds. At this moment, we derive a perfect ORAM scheme with \( O(\log^3 N/\log \log N) \) deterministic overhead. This warmup result already improves upon Chan et al. [CNS18] who showed \( O(\log^3 N) \) expected simulation overhead, as well as Raskin [RS19] who showed roughly \( \sqrt{N} \) deterministic simulation overhead.

### 2.4 Parallelizing the Scheme

So far, for simplicity we have focused on the sequential case. To obtain our OPRAM result, we need to make the above scheme parallel. To this end, we will rely on the OPRAM techniques by Chan et al. [CNS18], that is, the fetch phase is still performed sequentially, but the maintain phase is realized using parallel and oblivious sorting, tight compaction, and random permutation. One main challenge here is that we will need a parallel counterpart for the Intersperse algorithm. Note that Asharov et al. [AKL+20a]'s Intersperse algorithm gives deterministic performance bounds and perfect security, but is inherently sequential. We devise a new parallel Intersperse algorithm that preserves the same asymptotical total work as the sequential version of Asharov et al. [AKL+20a] (for fat-blocks); however, the algorithm gives Las-Vegas-type performance. This is fine if we only aim for an \( O(\log^3 / \log \log N) \)-expected-overhead OPRAM, but it will not work if we want matching high probability performance bounds. Observe that our parallel Intersperse algorithm’s performance bounds are more concentrated around the mean for larger input sizes. In our OPRAM, the Intersperse algorithm needs to be applied to input arrays containing \( 1, 2, 4, \ldots, N/\log N \) fat blocks. Therefore, to get \( O(\log^3 / \log \log N) \) overhead with \( 1 - \text{negl}(N) \) probability, our idea is to apply the Las-Vegas Intersperse algorithm only to sufficiently large instances, and for small instances, we apply a variant which gives deterministic performance bounds but is a logarithmic factor more expensive. We prove that this bi-modal approach gives an OPRAM scheme whose simulation overhead is \( O(\log^3 / \log \log N) \) with \( 1 - \text{negl}(N) \) probability.

Finally, to get an OPRAM with deterministic performance bounds, we need to replace the Intersperse building block entirely with one that gives deterministic performance bounds, but is a
logarithmic factor more expensive. This explains why our OPRAM with deterministic performance bounds has an extra logarithmic factor.

2.5 Open Questions

Our paper raises several interesting open questions. First, for constructions with deterministic performance bounds, currently our OPRAM scheme has a logarithmic factor higher simulation overhead than the sequential ORAM — this extra logarithmic factor stems from the parallel perfect oblivious permutation building block we use [AS96]. One open question is whether we can get rid of this extra logarithmic factor.

In our paper, we adopt the standard notion of simulation overhead originally defined by Goldreich and Ostrovsky [GO96,Gol87]. This standard notion is naturally amortized over the multiple steps of the RAM/PRAM, since it takes the ratio of the total runtime of the ORAM/OPRAM and that of the original RAM/PRAM. In comparison, some earlier works consider an even stronger notion, often referred to as worst-case overhead [OS97,SCSL11,KLO12,CGLS17,RS19]: a worst-case overhead of $\chi$ requires that every (parallel) step of the original RAM/PRAM is simulated by at most $\chi$ steps in the compiled ORAM/OPRAM. While some previous ORAM/OPRAM constructions are amenable to a standard deamortization trick [OS97,KLO12] to achieve worst-case overhead that match the amortized, our constructions are not compatible with standard deamortization techniques [KLO12] for a similar reason why PanORAMa [PPRY18] and OptORAMa [AKL+20a] are also not compatible with standard deamortization. It is due to the “residual randomness” technique: after the residual randomness of an element is used to facilitate a random permutation, if the same element is then accessed due to the deamortization, then such residual randomness is revealed and the random permutation is no longer random in the adversarial view, which is insecure. An interesting future direction is whether we can achieve worst-case overhead that match the amortized bounds claimed in our paper.

Our paper focuses on the theoretical understanding of the asymptotic complexity of perfectly secure ORAMs/OPRAMs. Our asymptotic constant is huge due to using AKS sorting network [AKS83] and the linear tight compaction [AKL+20b]. The constants are similar to earlier works [DMN11,CNS18] that also use AKS sorting. A standard way to replace the huge constant with another logarithmic factor is to replace both AKS sorting and tight compaction with bitonic sorter [Bat68] (notice that the standard bitonic sort takes $O(n \log^2 n)$ work, but to sort only 0/1 elements, i.e. tight compaction, a bitonic sort augmented with counting takes only $O(n \log n)$ work). An interesting question is whether we can achieve the performance bounds claimed in this paper, but without the use of expander graphs with large constants.

Last but not the least, for perfectly secure ORAM/OPRAM (and in fact even for statistically secure ORAM/OPRAM), we still do not have matching upper- and lower-bounds. Therefore, a more challenging direction is to close this obvious gap in our understanding.

2.6 Roadmap of Subsequent Formal Sections

In the technical sections, we formalize the blueprint described in this section. Our formal description is modularized which will facilitate formal analysis and proofs. Moreover, in our formal sections we will directly present the OPRAM result (since the sequential ORAM is a special case of the more general OPRAM result).

3 Preliminaries
3.1 Definitions

3.1.1 Parallel Random-Access Machines

We review the concepts of a parallel random-access machine (PRAM) and an oblivious parallel random-access machine (OPRAM). The definitions in this section are borrowed from Chan et al. \[CNS18\]. Although we give definitions only for the parallel case, we point out that this is without loss of generality, since a sequential RAM can be thought of as a special case PRAM with one CPU.

Parallel Random-Access Machine (PRAM) A parallel random-access machine consists of a set of CPUs and a shared memory denoted by $\text{mem}$ indexed by the address space $\{0, 1, \ldots, N - 1\}$, where $N$ is a power of 2. In this paper, we refer to each memory word also as a block, which is at least $w = \Omega(\log N)$ bits long.

We consider a PRAM model where the number of CPUs is fixed to be some parameter $m \leq N^5$. Each CPU has a state that stores $O(1)$ blocks. In each step, each CPU executes a next instruction circuit denoted $\Pi$, and then interacts with memory. Circuit $\Pi$ can perform word-level operations including addition, subtraction, and bit-wise boolean operations in unit time and then updates the CPU state. Further, CPUs interact with memory through request instructions $\vec{I}(t) := (I^t_i : i \in [m])$. Specifically, at time step $t$, CPU $i$’s instruction is of the form $I^t_i := (\text{read}, \text{addr})$, or $I^t_i := (\text{write}, \text{addr}, \text{data})$ where the operation is performed on the memory block with address $\text{addr}$ and the block content $\text{data}$.

Write conflict resolution. By definition, multiple read operations can be executed concurrently with other operations even if they visit the same address. However, if multiple concurrent write operations visit the same address, a conflict resolution rule will be necessary for our PRAM to be well-defined. In this paper, we assume the following:

- The original PRAM supports concurrent reads and concurrent writes (CRCW) with an arbitrary, parametrizable rule for write conflict resolution. In other words, there exists some priority rule to determine which write operation takes effect if there are multiple concurrent writes in some time step $t$.

- Our compiled, oblivious PRAM (defined below) is a “concurrent read, exclusive write” PRAM (CREW). In other words, our OPRAM algorithm must ensure that there are no concurrent writes at any time.

CPU-to-CPU communication. In the remainder of the paper, we sometimes describe our algorithms using CPU-to-CPU communication. For our OPRAM algorithm to be oblivious, the inter-CPU communication pattern must be oblivious too. We stress that such inter-CPU communication can be emulated using shared memory reads and writes. Therefore, when we express our performance metrics, we assume that all inter-CPU communication is implemented with shared memory reads and writes.

\[5\text{If } N < m, \text{ the oblivious simulation can be achieved by assigning at most one address to each CPU and then performing oblivious routing \[BCP16\], which takes only } O(\log m) \text{ overhead.}\]
Additional assumptions and notations. Henceforth, we assume that each CPU can only store $O(1)$ memory blocks. Further, we assume for simplicity that the runtime $T$ of the PRAM is fixed a priori and publicly known. Therefore, we can consider a PRAM to be parametrized by the following tuple

$$\text{PRAM} := (\Pi, N, T, m),$$

where $\Pi$ denotes the next instruction circuit, $N$ denotes the total memory size (in terms of number of blocks), $T$ denotes the PRAM’s total runtime, and $m$ denotes the number of CPUs.

Finally, in this paper, we consider PRAMs that are stateful and can evaluate a sequence of inputs, carrying states in between, where each input can be stored in a single memory block.

3.1.2 Oblivious Parallel Random-Access Machines

An OPRAM is a (randomized) PRAM with certain security properties, i.e., its access patterns leak no information about the inputs to the PRAM.

Randomized PRAM. A randomized PRAM is a PRAM where the CPUs are allowed to generate private random numbers. Concretely, we assume that the next instruction circuit $\Pi$ can sample a uniform random number from $[a]$ for any positive integer $a \leq 2^w$ in unit time (recall that $w$ is the memory word size in bits), where the assumption is needed by the oblivious random permutation (e.g., [AS96]). For simplicity, we assume that a randomized PRAM has a priori known, deterministic runtime, and that the CPU activation pattern in each time step is also fixed a priori and publicly known.

Memory access patterns. Given a PRAM program denoted $\text{PRAM}$ and a sequence $\text{inp}$ of inputs, we define the notation $\text{Addresses}[\text{PRAM}](\text{inp})$ as follows:

- Let $T$ be the total number of parallel steps that $\text{PRAM}$ takes to evaluate inputs $\text{inp}$.
- Let $A_t := (\text{addr}_1^t, \text{addr}_2^t, \ldots, \text{addr}_m^t)$ be the list of addresses such that the $i$-th CPU accesses memory address $\text{addr}_i^t$ in time step $t$.
- We define $\text{Addresses}[\text{PRAM}](\text{inp})$ to be the random object $\{A_t\}_{t \in [T]}$.

Oblivious PRAM (OPRAM). We say that a PRAM is perfectly oblivious, iff for any two input sequences $\text{inp}_0$ and $\text{inp}_1$ of equal length, it holds that the following distributions are identically distributed (where $\equiv$ denotes identically distributed):

$$\text{Addresses}[\text{PRAM}](\text{inp}_0) \equiv \text{Addresses}[\text{PRAM}](\text{inp}_1)$$

We remark that for statistical and computational security, some earlier works [CGLS17, CS17] presented an adaptive, composable security notion. The perfectly oblivious counterpart of their adaptive, composable notion is equivalent to our notion defined above. In particular, our notion implies security against an adaptive adversary who might choose the input sequence $\text{inp}$ adaptively over time after having observed partial access patterns of $\text{PRAM}$.

We say that OPRAM is a perfectly oblivious simulation of PRAM iff OPRAM is perfectly oblivious, and moreover OPRAM($\text{inp}$) is identically distributed as PRAM($\text{inp}$) for any input $\text{inp}$.

Metrics. We will use the standard notion of simulation overhead to characterize an OPRAM’s performance [GO96, Gol87, BCP16]. If a PRAM that consumes $m$ CPUs and completes in $T$ parallel steps can be obliviously simulated by an OPRAM that completes in $\gamma \cdot T$ steps also with $m$ CPUs, then we say that the simulation overhead is $\gamma$. Moreover, supposing that the OPRAM is randomized (by a random tape that is independent from the PRAM), and letting the OPRAM completes in $T'$
steps with $m$ CPUs where $T'$ is a random variable, we say that the expected simulation overhead is $\gamma$ if $\mathbb{E}[T'] = \gamma T$, and we say that the simulation overhead is $\gamma$ with probability $1-\delta$ if $\Pr[T' \leq \gamma T] \geq 1-\delta$. We additionally say the simulation overhead $\gamma$ is deterministic if it is $\gamma$ with probability 1, which coincides with the standard simulation overhead.

More generally, suppose that an ample (i.e., unbounded) number of CPUs are available: in this case if algorithm can be completed in $T$ parallel steps consuming $m_1, m_2, \ldots, m_T$ CPUs in each step respectively, then we say that the algorithm can be completed in $T$ depth and $W := \sum_{t \in [T]} m_t$ total work. Similar to that of simulation overhead, when the total work and depth are random variables, we quantify the total work and depth using expected, with probability, or deterministic, where deterministic is sometimes omitted.

Therefore, for an OPRAM, if the original PRAM (taking $T$ parallel steps and using $m$ CPUs) can be obliviously simulated in $W'$ total work and $T' = O(W'/m)$ depth then the OPRAM has simulation overhead $W'/Tm$.

**Oblivious simulation of a non-reactive functionality.** For defining the security of intermediate building blocks, we now define what it means to obliviously realize a non-reactive functionality. Let $\mathcal{F} : \{0,1\}^* \rightarrow \{0,1\}^*$ be a possibly randomized functionality. We say that $M_\mathcal{F}$ is a perfect oblivious simulation (or oblivious simulation for short) of $\mathcal{F}$ with leakage $\mathcal{L}$, iff there exists a simulator $\text{Sim}$, such that for every input $x \in \{0,1\}^*$, the following real-world and ideal-world distributions are identical:

- **Real world:** execute $M_\mathcal{F}(x)$ and let $y$ be the output and $\text{Addr}$ be the memory access patterns; output $(y, \text{Addr})$;
- **Ideal world:** output $(\mathcal{F}(x), \text{Sim}(\mathcal{L}(x)))$.

For simplicity, if the leakage function $\mathcal{L}(x) = |x|$, we often say that $M_\mathcal{F}$ is a perfect oblivious simulation of $\mathcal{F}$ (omitting the leakage function) for short.

**Modeling input assumptions.** Some of our building blocks provide perfect obliviousness only if the input array is randomly shuffled and the corresponding randomness concealed. More formally, suppose that a machine $M(A,x)$ and a functionality $\mathcal{F}(A,x)$ both take in an array $A \in D^n$ where $D \in \{0,1\}^\ell$ as input and possibly an additional input $x \in \{0,1\}^*$. Formally, we say that “the machine $M$ is a perfectly oblivious simulation of the functionality $\mathcal{F}$ with leakage $\mathcal{L}$ assuming that the input array $A$ is randomly shuffled”, iff for every $A \in D^n$ and every $x \in \{0,1\}^*$, the following real-world and ideal-world distributions are identical:

- **Real world:** randomly shuffle the array $A$ and obtain $A'$, execute $M_\mathcal{F}(A',x)$ and let $y$ be the output and $\text{Addr}$ be the memory access patterns; output $(y, \text{Addr})$;
- **Ideal world:** output $(\mathcal{F}(A,x), \text{Sim}(\ell, \mathcal{L}(A,x)))$.

Note that the above definition considers only a single input array $A$, but there is a natural generalization for algorithms that take two or more input arrays — in this case we may require that some or all of these input arrays be randomly shuffled to achieve perfect obliviousness.

### 3.2 Oblivious Algorithm Building Blocks

We describe some algorithmic building blocks. Unless otherwise noted, for algorithms that operate on arrays of $n$ elements, we always assume that a single memory word is wide enough to store the index of each element within the array, i.e., $w \geq \log n$ where $w$ is the bit-width of each PRAM word. We typically use the following notation: let $B$ denote the bit-width of each element, and let $\beta := \lfloor B/w \rfloor$ denote the number of memory words it takes to store each element.
3.2.1 Oblivious Sort

Oblivious sorting can be accomplished through a sorting network such as the famous construction by Ajtai, Komlós, and Szemerédi [AKS83]. We restate this result in the context of PRAM algorithms:

**Theorem 3.1 (Oblivious sorting [AKS83]).** There exists a deterministic, oblivious algorithm that sorts an array of $n$ elements consuming $O(\beta \cdot n \log n)$ total work and $O(\log n)$ depth where $\beta \geq 1$ denotes the number of memory words it takes to represent each element.

3.2.2 Oblivious Random Permutation

Let ORP be an algorithm that upon receiving an input array $X$, outputs a permutation of $X$. Let $F_{\text{perm}}$ denote an ideal functionality that upon receiving the input array $X$, outputs a perfectly random permutation of $X$. We say that ORP is a perfectly oblivious random permutation, iff it is a perfect oblivious simulation of the functionality $F_{\text{perm}}$. Recall that for any integer $m \in [n]$, each CPU of the PRAM can sample an integer uniformly at random from $[m]$ in unit time.

**Sequential ORP algorithm with deterministic performance bounds.** Recently, a sequential oblivious algorithm is developed to perform such permutation in $O(n \log n)$ total work [AKL+20a, Section 6.4].

**Theorem 3.2 (A sequential ORP algorithm [AKL+20a]).** Let $\beta \geq 1$ denote the number of memory words it takes to represent each element. There exists an oblivious random permutation construction that completes in deterministic $O(\beta \cdot n \log n)$ total work.

**Parallel ORP algorithm with deterministic performance bounds.** Alonso and Schott [AS96] construct a parallel random permutation algorithm that takes $O(n \log^2 n)$ total work and $O(\log^2 n)$ depth to randomly permute $n$ elements. Although achieving obliviousness was not a goal of their paper, it turns out that their algorithm is also perfectly oblivious, giving rise to the following theorem:

**Theorem 3.3 (Alonso-Schott ORP).** There is a perfectly oblivious algorithm that permutes an array of $n$ elements in deterministic $O(\beta \cdot n \log n + n \log^2 n)$ total work and $O(\log^2 n)$ depth where $\beta \geq 1$ denotes the number of memory words for representing each element.

**Parallel, Las Vegas ORP algorithm.** A few recent works [CCS17, AKL+20a] describe another perfectly oblivious random permutation algorithm which is asymptotically more efficient but the algorithm is Las Vegas, i.e., the algorithm satisfies perfect obliviousness and correctness, but with a small probability the algorithm may run longer than the stated bound. Below, we restate this result in the form that we desire in this paper — the specific theorem stated below arises from the improved analysis of Asharov et al. [AKL+20a, Theorem 4.3] where we replace the “quadratic oblivious random permutation” with Alonso-Schott ORP; for the performance bounds, we state an expected version and a high-probability version. Notice that the replaced ORP incurs an $O(\log^2 n)$ depth with probability $o(1)$ but not in expectation.

**Theorem 3.4 (A Las Vegas ORP algorithm).** Let $\beta \geq 1$ denote the number of memory words it takes to represent each element. There exists a Las Vegas perfectly oblivious random permutation construction that completes in expected $O(\beta \cdot n \log n)$ total work and expected $O(\log n)$ depth. Furthermore, except with $n^{-\Omega(\sqrt{n})}$ probability, the algorithm completes in $O(\beta \cdot n \log n)$ total work and $O(\log^2 n)$ depth.

---

6 Using more depth but only unbiased random bits, Czumaj [Czu15] shows a Las Vegas switching network to achieve the same abstraction.
Note that the above theorem gives a high-probability performance bound for sufficiently large $n$. Later in our OPRAM construction, we will adopt ORP for problems of different sizes — we will use Theorem 3.4 for sufficiently large instances and use Theorem 3.3 for small instances.

3.2.3 Oblivious Routing

Oblivious routing \cite{BCP16} is the following primitive where $n$ source CPUs wish to route data to $n'$ destination CPUs based on the key.

- **Inputs:** The inputs contain two arrays: 1) a source array $\text{src} := \{(k_i, v_i)\}_{i \in [n]}$ where each element is a (key, value) pair or a dummy element denoted $(\bot, \bot)$; and 2) a destination array $\text{dst} := \{k'_i\}_{i \in [n']}$ containing a list of (possibly dummy) keys.

We assume that each (non-dummy) key appears no more than $C$ times in the $\text{src}$ array where $C = O(1)$ is a known constant; however, each (non-dummy) key can appear any number of times in $\text{dst}$.

- **Outputs:** We would like to output an array $\text{Out} := \{v'_{i,j}\}_{i \in [n'], j \in [C]}$ where $(v'_{i,1}, \ldots, v'_{i,C})$ contains all the values contained in $\text{src}$ whose keys match $k'_i$ (padded with $\bot$ to length $C$).

**Theorem 3.5** (Oblivious routing \cite{BCP16,CS17,CCS17}). There exists a perfectly oblivious routing algorithm that accomplishes the above task in $O(\log(n + n'))$ depth and $O(\beta \cdot (n + n') \log(n + n'))$ total work where $\beta \geq 1$ denotes the number of words it takes to represent each element.

3.2.4 Oblivious Tight Compaction

As mentioned in Section 2.2, tight compaction is the following task: given an input array containing $n$ elements where each element is tagged with a bit indicating whether it is real or dummy, produce an output array containing also $n$ elements such that all real elements in the input appear in the front and all dummies appear at the end. We will use the parallel oblivious tight compaction of Asharov et al. \cite{AKL20b} running in linear work and logarithmic depth.

**Theorem 3.6** (Oblivious tight compaction \cite{AKL20b}). There exists a deterministic, oblivious tight compaction algorithm that compacts an array of $n$ elements in total work $O(\beta \cdot n)$ and depth $O(\log n)$ where $\beta \geq 1$ denotes the number of words it takes to represent each element.

We point out that Asharov et al.’s oblivious parallel compaction algorithm \cite{AKL20b} works in the so-called *indivisibility* model, that is, the payload of the elements are moved around as opaque strings.

3.3 Parallel Intersperse

Oblivious intersperse is an abstraction that which can be used to mix two input arrays such that the mixing is uniformly at random in the adversarial view. The abstraction was originated in PanORAMa \cite{PPRY18} and then formally defined and realized in OptORAMa \cite{AKL20a}. In this section, we define intersperse for completeness and then state the sequential and parallel realizations that run in expected, high probability, or deterministic performance bounds.
3.3.1 Definition

Informally, in the definition of OptORAMa, the **Intersperse** algorithm receives the concatenation of the two input arrays and only the sum of their lengths is public but not each array’s individual length where each input array is shuffled uniformly at random, and then **Intersperse** is required to output a uniformly shuffled array consisting of all input elements. More specifically, **Intersperse** has the following syntax.

**Input.** The concatenated array $I_0 \| I_1$, and two integers $n_0 := |I_0|$ and $n_1 := |I_1|$.

**Output.** An array $B$ of size $n = n_0 + n_1$ that contains all elements of $I_0$ and $I_1$. Each position in $B$ will hold an element from either $I_0$ or $I_1$, chosen uniformly at random and the choices are concealed from the adversary.

We now define the security notion required for **Intersperse**. We require that when we run **Intersperse** on two input arrays $I_0$ and $I_1$ that are both randomly shuffled (based on a secret permutation), the resulting array will be randomly shuffled (based on a secret permutation) too. More formally stated, we require that **Intersperse** is a perfect oblivious simulation of the following $F_{\text{shuffle}}(I_0, I_1)$ functionality provided that the two input arrays are randomly shuffled. Henceforth we assume that the bit-width of each element in the input arrays is a publicly known parameter that the scheme is implicitly parametrized with.

$F_{\text{shuffle}}(I_0 \| I_1, n_0, n_1)$:

1. Choose a permutation $\pi : [n] \to [n]$ uniformly at random where $n := |I_0| + |I_1|$.
2. Let $I$ be the concatenation of $I_0 \| I_1$.
3. Initialize an array $B$ of size $n$. Assign $B[i] := I[\pi(i)]$ for every $i = 1, \ldots, n$.
4. **Output:** The array $B$.

The recent work OptORAMa [AKL+20a, Claim 6.3] showed how to construct an **Intersperse** algorithm in linear time, i.e., $O(n)$; however, their algorithm is inherently sequential (see the following warmup). A manuscript by Asharov et al. [AKL+20c] considered how to devise a parallel version of **Intersperse** in an attempt to make OptORAMa parallel; but their parallel **Intersperse** algorithm achieves only statistical security. Later in this section we will describe a variant of the parallel intersperse that is perfectly secure but consumes more cost.

3.3.2 Warmup: A Sequential, Linear-Work Intersperse Algorithm

Asharov et al. [AKL+20a] used the following method to construct a sequential **Intersperse** algorithm:

1. First, initialize an array $Aux$ of size $n$ that has $n_0$ zeros and $n_1$ ones, where the zeros’ positions are chosen uniformly at random (and the remaining positions are ones). More formally, the algorithm must obliviously simulate the following $F_{\text{SampleAux}}(n, n_0)$ functionality with leakage $(n, n_0)$.

\footnote{The algorithm of Asharov et al. [AKL+20c] may abort and fail with a negligible probability, and such negligible event reveals some information about the input $(n, n_0)$ so that it is only statistically secure.}
\( F_{\text{SampleAux}}(n, n_0) - \text{Sample Auxiliary Array} \)

- **Input:** Two numbers \( n, n_0 \in \mathbb{N} \) such that \( n_0 \leq n \).
- **The functionality:** Sample an array \( \text{Aux} \) of \( n \) bits uniformly at random conditioned on having \( n_0 \) zeros and \( n - n_0 \) ones. Output \( \text{Aux} \).

2. Next, we route elements 1-to-1 from \( I_0 \) to zeros in \( \text{Aux} \) and 1-to-1 route elements from \( I_1 \) to ones in \( \text{Aux} \). This can be accomplished by running oblivious tight compaction circuit (Theorem 3.6) to pack all the 0s in \( \text{Aux} \) in the front. During the process, all swap gates remember their routing decisions. Now, we can run the oblivious tight compaction circuit in reverse and on the input array \( I_0||I_1 \). It is not hard to see that in the outcome, every 0 position in \( \text{Aux} \) would receive an element from \( I_0 \) and every 1 position in \( \text{Aux} \) would receive an element from \( I_1 \).

Asharov et al. [AKL+20a, Claim 6.3] proved that the above algorithm indeed realizes the \text{Intersperse} abstraction as defined above. Moreover, they show how to implement the above idea obliviously in linear-time, resulting in the following theorem:

**Theorem 3.7 (Sequential, linear-time Intersperse [AKL+20a]).** There exists an algorithm that perfectly obliviously simulates \( F_{\text{shuffle}} \) for two randomly shuffled input arrays. Moreover, the algorithm completes in deterministic \( O(3n) \) total work where \( n \) denotes the sum of the lengths of the two input arrays, and \( \beta \geq 1 \) denotes the number of memory words required to represent each element.

### 3.3.3 Parallel Intersperse Algorithms

We need a parallel version of the \text{Intersperse} algorithm. In Asharov et al. [AKL+20a]’s \text{Intersperse} construction, while the oblivious tight compaction building block can be replaced with a parallel realization of tight compaction (Theorem 3.6), unfortunately they adopt a highly sequential procedure for generating the \( \text{Aux} \) array. To get a parallel algorithm, it suffices to devise a parallel procedure for generating such an \( \text{Aux} \) array. More formally, we would like to devise an algorithm that obliviously simulates the functionality \( F_{\text{SampleAux}}(n, n_0) \).

A \textbf{naïve algorithm with deterministic performance}. A naïve algorithm is the following: simply write down exactly \( n_0 \) number of 0s and \( n - n_0 \) number of 1s, apply an oblivious random permutation to permute the array, and output the result. If we use Theorem 3.3 to instantiate this naïve algorithm, we obtain the following theorem:

**Theorem 3.8 (Naïve parallel algorithm for sampling \( \text{Aux} \)).** For any \( n_0 \leq n \), there exists an algorithm that perfectly obliviously simulates \( F_{\text{SampleAux}}(n, n_0) \); moreover, for sampling an \( \text{Aux} \) array of length \( n \), the algorithm completes in deterministic \( O(n \log^2 n) \) total work and \( O(\log^2 n) \) depth.

This immediately gives rise to the following corollary for \text{Intersperse} due to the result of Asharov et al. [AKL+20a] and parallel tight compaction (Theorem 3.6):

**Corollary 3.9 (Naïve parallel \text{Intersperse}).** There exists an algorithm that perfectly obliviously simulates \( F_{\text{shuffle}} \) for two randomly shuffled input arrays. Moreover, the algorithm completes in deterministic \( O(3n + n \log^2 n) \) total work and \( O(\log^2 n) \) depth where \( n \) denotes the sum of the lengths of the two input arrays, and \( \beta \geq 1 \) denotes the number of memory words required to represent each element.
A more efficient Las Vegas algorithm. To obliviously simulates the functionality $F_{SampleAux}(n, n_0)$ with better performance, we use the Las Vegas version of oblivious random permutation, Theorem 3.4, which gives the following theorem:

**Theorem 3.10** (Las Vegas parallel algorithm for sampling Aux). For any $n_0 \leq n$, there exists a Las Vegas algorithm that perfectly obliviously simulates $F_{SampleAux}(n, n_0)$. Except with probability $n^{-\Omega(\sqrt{n})}$, the algorithm completes in $O(n \log n)$ total work and $O(\log^2 n)$ depth. Furthermore, the above stated performance bounds also apply in expectation.

Now due to the work of Asharov et al. [AKL+20a] and parallel tight compaction (Theorem 3.6), we have the following corollary.

**Corollary 3.11** (Parallel Intersperse). Let $\beta \geq 1$ be the number of words used to represent an element. There is an Intersperse algorithm that is a perfectly oblivious simulation of $F_{shuffle}$ on two randomly shuffled input arrays; moreover, except with $n^{-\Omega(\sqrt{n})}$ probability, the algorithm completes in $O(\beta n + n \log n)$ total work and $O(\log^2 n)$ depth. Moreover, the stated performance bounds also apply in expectation.

### 4 One-Time Memory

We describe an abstract data structure called an oblivious one-time memory (OTM) which will serve as a core building block in our OPRAM construction. Roughly speaking, a one-time memory (OTM) is initialized with a set of elements using a procedure called **Build**. Once initialized, it allows each element stored in it to be looked up at most once using a procedure called **Lookup**. Further, it is assumed that when each lookup request arrives, the request is accompanied by a correct “position label” for the element requested. When the OTM is no longer needed, one can call a **Getall** operation to extract the set of remaining unvisited elements. A similar notion of oblivious OTM was formulated by Chan et al. [CNS18]. Moreover, assuming that each element can be represented with $\chi \geq 1$ words, Chan et al. show how to construct a perfectly oblivious OTM that consumes $O(\chi n \log n)$ total work to initialize an OTM data structure containing $n$ elements; and where each lookup incurs only $O(\chi)$ overhead.

In this section, we construct an oblivious OTM assuming that the input array of elements provided to the OTM at initialization has already been randomly shuffled (and the randomness hidden from the adversary). Our goal is to allow each lookup to be supported with $O(\chi)$ total work as before (as long as a correct position label accompanies each lookup request); however, we would like the initialization procedure to consume only $O(n \cdot (\chi + \log n))$ total work which is asymptotically better than the OTM of Chan et al. when $\chi$ dominates $\log n$. In other words, in our construction, the initialization procedure is allowed to perform only linear work moving and/or copying the fat elements (i.e., a bundle of $\chi$ words), but is additionally allowed $O(n \log n)$ amount of computation on metadata.

#### 4.1 Definition

A parallel oblivious one-time memory supports three operations: 1) **Build**, 2) **Lookup**, and 3) **Getall**. **Build** is called once upfront to create the data structure: it takes in a set of randomly permuted real elements (each tagged with its logical address) and creates a data structure that facilitates lookup. After this data structure is created, a sequence of lookup operations can be performed: each lookup can request a real element identified by its logical address or a dummy address denoted $\perp$ — if the requested element has a real address, we assume that the correct position label is supplied to
indicate where in the data structure the requested element is. Finally, when the data structure is no longer needed, one may call a Getall operation to obtain a list of real elements (tagged with their logical addresses) that have not been looked up yet, mixed with an appropriate number of dummies, and permuted according to a secret random permutation.

We require that our oblivious one-time memory data structure retain obliviousness as long as 1) the sequence of real addresses looked up all exist in the data structure (i.e., it appeared as part of the input to Build), and 2) each real address is looked up at most once.

4.1.1 Formal Definition

A (parallel) one-time memory scheme denoted \( \text{OTM}[^{n,m,t}] \) is parametrized by three parameters: \( n \) denotes the upper bound on the number of real elements; \( m \) is the batch size for lookups; \( t \) is the number of batch lookups supported.

The scheme \( \text{OTM}[^{n,m,t}] \) is comprised of the following possibly randomized, stateful algorithms (Build, Lookup, Getall), to be executed on a Concurrent-Read, Exclusive-Write PRAM — note that since the algorithms are stateful, every invocation will update an implicit data structure in memory. Henceforth we use the terminology key and value in the formal description but in our OPRAM scheme later, a real key will be a logical memory address and its value refers to its content.

- \( U \leftarrow \text{Build}(S) \): The algorithm takes as input an array \( S \) of \( n \) elements, where each element is either a real key-value pair of the form \( (k_i, v_i) \), or dummy denoted \( (\perp, \perp) \); moreover any two real elements in \( S \) must have distinct keys. The algorithm then creates an in-memory data structure to facilitate subsequent lookup requests (not included in the output); moreover it outputs a position-label array \( U \) containing exactly \( n \) key-position pairs each of the form \( (k, \text{pos}) \). Further, every real key in the input \( S \) will appear exactly once in the list \( U \); and the list \( U \) is padded with \( \perp \) to a length of \( n \).

Recall that each value \( v_i \) in the input \( S \) can be “fatter” than its position label \( \text{pos} \) that is included in the output \( U \). Later in our OPRAM scheme (Section 5), this key-position list \( U \) will be propagated back to the parent recursion depth during a coordinated rebuild.

- \( (v_i : i \in [m]) \leftarrow \text{Lookup}( (k_i, \text{pos}_i) : i \in [m]) \): there are \( m \) concurrent Lookup requests in a single batch, where we allow each key \( k_i \) requested to be either real or \( \perp \). If \( k_i \) is a real key, then \( k_i \) must be contained in \( S \) that was input to Build earlier. In other words, Lookup requests are not supposed to ask for real keys that do not exist in the data structure. Moreover, each real \( (k_i, \text{pos}_i) \) pair supplied to Lookup must exist in the \( U \) array returned by the earlier invocation of Build, i.e., \( \text{pos}_i \) must be a correct position label for \( k_i \).

- \( R \leftarrow \text{Getall} \): the Getall algorithm returns an array \( R \) of length \( n \) where each entry is either \( \perp \) or real and of the form \( (k, v) \). The array \( R \) should contain all real elements inserted during Build but have not been looked up yet, mixed with \( \perp \) to a length of \( n \).

Valid request sequence. Our oblivious one-time memory ensures correctness and obliviousness only if the sequence of requests is valid, defined as below. Roughly speaking, a request sequence is

8 Note that we do not explicitly denote the implicit data structure in the output of Build, since the implicit data structure is needed only internally by the current oblivious one-time memory instance. In comparison, \( U \) is explicitly outputted since \( U \) will later on be (externally) needed by the parent recursion depth in our OPRAM construction.

9 We emphasize this is a major difference between this one-time memory scheme and the oblivious hashing abstraction of Chan et al. [CGLS17]; Chan et al.’s abstraction [CGLS17] allows lookup queries to ask for keys that do not exist in the data structure.
valid only if lookups are non-recurrent (i.e., never look for the same real key twice); and moreover the number of batch requests must be exactly the predetermined parameter $t$. More formally, a sequence of operations is valid, iff:

- The sequence begins with a single call to Build upfront; followed by a sequence of $t$ batch Lookup calls, each of which supplies a batch of $m$ keys and the corresponding position labels; and finally the sequence ends with a single call to Getall.

- Also, in all Lookup operations in the sequence, no two real keys requested (either within the same batch or across different batches) are the same.

**Correctness.** Correctness requires that

1. For any valid request sequence, with probability 1, for every Lookup $((k_i, \text{pos}_i) : i \in [m])$ request, if $k_i = \bot$, the $i$-th answer returned must be $\bot$; else if $k_i \neq \bot$, Lookup must return the correct value $v_i$ associated with $k_i$ that was input to the earlier invocation of Build.

2. For any valid request sequence, with probability 1, Getall must return an array $R$ containing every $(k, v)$ pair that was supplied to Build but has not been looked up; moreover the remaining elements in $R$ must all be $\bot$.

**Perfect obliviousness.** For obliviousness, we require that there exists a simulator $\text{Sim}(1^n, 1^m, 1^t)$ that takes in only the length of the input array provided to Build, the number of requests in a concurrent batch, and the total number of batched requests the OTM must support, such that the following holds. For any input array $S$ consisting of $n$ elements, any sequence of batched requests $K := \{k_{i,j} \}_{i \in [t], j \in [m]}$ such that every key queried must appear in $S$ and moreover, every key is looked up at most once in the same batch or across batches, the following real- and ideal-world distributions must be identical:

- **Real-world.** Consider the following real-world experiment.
  1. Randomly shuffle the input array $S$; and run Build on the outcome;
  2. Make a sequence of $t$ batch Lookup operations defined by $K$, and in every request in any batch, provide the correct position labels as defined by the output $U$ of Build;
  3. Run Getall and let $R$ be the resulting array.
  4. The real-world distribution is defined by the tuple $(\text{Addresses}, R)$ where $\text{Addresses}$ is the access patterns incurred by the OTM in the above experiment.

- **Ideal world.** The ideal-world experiment outputs the following joint distribution:

$$\left(\text{Sim}(1^n, 1^m, 1^t), \mathcal{F}_{\text{getall}}(S, K)\right)$$

where $\mathcal{F}_{\text{getall}}(S, K)$ is the ideal functionality that performs the following: mark every entry in $S$ whose key is contained in $K$ as dummy, randomly shuffle the resulting array and output it.
4.2 Construction

4.2.1 Intuition

We would like achieve the following obliviously. When Build receives an input array of elements, we want to create 1) an array $A$ containing all real elements in the input and an appropriate number of dummies such that all elements are randomly shuffled; and 2) a dummy metadata array denoted $\text{dummy}$ that contains a randomly permuted list of the locations of dummy elements in $A$. When a batch of $m$ Lookup requests arrive, each of the $m$ requests is either a real request tagged with the desired element’s correct position in $A$; or it is a dummy request denoted as $\bot$. Imagine that there are $m$ CPUs, i.e., one for serving each of the $m$ requests. The $m$ CPUs find the next $m$ unvisited dummy positions from the array $\text{dummy}$, denoted $\text{dpos}_1, \ldots, \text{dpos}_m$. For each CPU $i \in [m]$, if it received a real request, it fetches the element from the specified position (that accompanies the request); otherwise it fetches a dummy element from position $\text{dpos}_i$. Finally, Getall simply removes all the visited locations from $A$ and returns the remaining unvisited elements — it is not hard to see that in the array returned by Getall, all elements are randomly shuffled. We stress that the number of dummy elements in the array $A$ must be sufficient to support the number of lookup queries to the OTM.

The main challenge is how to realize the Build procedure obliviously consuming only linear total work on the (possibly fat) elements but allowing $O(n \log n)$ total work on metadata. Here we exploit the fact that the input array has already been randomly shuffled and the randomness hidden from the adversary. Therefore, we only have to pad the input array with an appropriate number of dummies and intersperse this concatenated array. For building the dummy metadata array $\text{dummy}$, we only have to deal with metadata, thus we can rely on standard oblivious sorting techniques.

4.2.2 Detailed Construction

Build aims to create an in-memory data structure consisting of the following:

1. An array $A$ of length $n + \tilde{n}$, where $\tilde{n} := tm$ denotes the number of added dummies and $n$ denotes the number of real elements. Each entry of the array $A$ (real or dummy alike) contains a key-value pair $(\text{key}, \text{val})$ (where $\text{val}$ can be of large size).

2. An array $\text{dummy}$ of $\tilde{n}$ indices that indicate the positions of the added dummies within $A$, and a counter $\text{count}$ that keeps track of how many elements have been looked up so far.

These in-memory data structures, $(A, \text{dummy}, \text{count})$, will then be updated during Lookup.

Build Algorithm Build $((k_i, v_i) : i \in [n])$ proceeds as follows.

1. Initialize. In parallel, construct an array $A_1$ of length $n$ are copied from the input, an array $A_0$ of length $\tilde{n}$ with entries set to $(\bot, \bot)$.

2. Permute real and dummy elements. Perform Parallel Intersperse (Section 3.3) on the arrays $A_0, A_1$ by interleaving the $n$ elements from $A_1$ with the $\tilde{n}$ elements from $A_0$. The resulting permuted array is the $A$ in the data structure.

3. Construct the key-position map $U$. The map $U$ is constructed in the following steps.

   a) Let $M$ be a metadata array of length $n + \tilde{n}$, where the entries of $M$ are of the form $(\text{key}, \text{pos})$, and $\text{pos} \in [1..n + \tilde{n}]$ will index a position within the array $A$. For each $i \in [n + \tilde{n}]$ in parallel, set $M[i].\text{key} := A[i].\text{key}$ and $M[i].\text{pos} = i$. 

23
b) Oblivious sort the array $M$ on the keys to produce an array $\hat{M}$; we use the convention that the extra dummy keys $\perp$’s are at the end.

c) We construct the key-position map $U$ from the first $n$ entries of $\hat{M}$ — recall that each entry of $U$ is of a key-position pair $(k, \text{pos})$.

4. **Construct the dummy indices.** For each $i \in [1..\tilde{n}]$, we denote $\hat{M}_n[i] := \hat{M}[n + i]$. Perform a perfectly oblivious random permutation (ORP, Section 3.2.2) on $\hat{M}_n[1..\tilde{n}]$ (which contain only metadata). We then construct the array of dummy indices: for $i \in [1..\tilde{n}]$ in parallel, we set $\text{dummy}[i] := \hat{M}_n[i].\text{pos}$.

We initialize the counter $\text{count} := 0$.

At this moment, the data structure $(A, \text{dummy}, \text{count})$ is stored in memory. The key-position map $U$ is explicitly output and later in our OPRAM scheme it will be passed to the parent recursion depth during coordinated rebuild.

If we instantiate Intersperse using the algorithm corresponding to Corollary 3.11 and instantiate the oblivious random permutation using the algorithm corresponding to Theorem 3.4, we obtain the following fact — for simplicity, throughout Sections 4 and 5, we will focus on the expected performance. Later in Section 6 we will describe how to obtain high-probability performance bounds (where we will need to instantiate small instances with non-Las-Vegas algorithms with deterministic performance bounds).

**Fact 1.** The Build algorithm completes in $O((n + \tilde{n}) \cdot (\chi + \log(n + \tilde{n})))$ total work and $O(\log^2(n + \tilde{n}))$ depth in expectation.

As mentioned before, when the elements can be “fat” and the metadata is “thin”, our Build is asymptotically more efficient than that of Chan et al. CNS18. We now prove the above fact.

**Proof.** The depth is dominated by Intersperse, which follows by Corollary 3.11. The total work on elements is dominated by the Intersperse procedure on $(A_0, A_1)$, which is $O((n + \tilde{n}) \cdot (\chi + \log(n + \tilde{n})))$ by Corollary 3.11. On metadata, the dominating subroutines are the oblivious sort (realized with the AKS sorting network AKSS3) on $M$ and the oblivious random permutation on $\hat{M}$ (Theorem 3.4). Hence, the summed total work is $O((n + \tilde{n}) \cdot (\chi + \log(n + \tilde{n})))$. \qed

**Lookup** We implement a batch of $m$ concurrent lookup operations $\text{Lookup}((k_i, \text{pos}_i) : i \in [m])$ as follows. For each $i \in [m]$, we perform the following in parallel.

1. **Decide position to fetch from.** If $k_i \neq \perp$ is real, set $\text{pos} := \text{pos}_i$, i.e., we want to use the position label supplied from the input. Else if $k_i = \perp$, set $\text{pos} := \text{dummy}[\text{count} + i]$, i.e., the position to fetch from is the next indexed dummy. (To ensure obliviousness, the algorithm can always pretend to execute both branches of the if-statement.)

At this moment, $\text{pos}$ is the position to fetch from (for the $i$-th request out of $m$ concurrent requests).

2. **Read and remove.** Read value from $A[\text{pos}]$, mark $A[\text{pos}] := \perp$.

3. **Update counter.** The counter is only updated once per batch request: $\text{count} := \text{count} + m$.

4. **Return.** Return the value read in the above Step 2.

The following fact is straightforward from the algorithm.
Fact 2. The Lookup algorithm runs in \(O(m\chi)\) total work and \(O(1)\) depth.

Getall By always having exactly \(t\) batch requests, there are exactly \(\tilde{n}\) entries in \(A\) have been accessed during previous Lookup operations. Our goal is to remove these accessed entries and output a list of remaining unvisited entries. Note that the algorithm need not hide which entries have been accessed since this information has already been observed by the adversary.

It is not hard to see that we can accomplish this removal in \(O((n + \tilde{n}) \cdot \chi)\) total work and \(O(\log(n + \tilde{n}))\) depth. Basically, the algorithm boils down to an all-prefix-sum calculation: suppose we write down \(b_i := 0\) if the \(i\)-th element has been visited and write down \(b_i := 1\) otherwise. Let \(s_i := \sum_{j=[i]} b_i\) denote the prefix sum up to index \(i\). We will then assign \(i\)-th CPU to grab the \(i\)-th element and if it is unvisited, the CPU places it at index \(s_i\) in the final output array.

To compute all prefix sums in parallel, we can rely on a binary tree — without loss of generality, assume that \(n + \tilde{n}\) is a power of 2.

1. Consider a binary tree with \(n + \tilde{n}\) leaves where the \(i\)-th leaf is tagged with the bit \(b_i\). Every node in the tree wants to compute two sums: 1) a subtree sum that sums up all leaves in its own subtree; and 2) a prefix sum defined as the sum of the entire prefix up to the rightmost child in its subtree.

2. First, compute the subtree sums of all nodes in \(O(n + \tilde{n})\) total work and \(O(\log(n + \tilde{n}))\) depth using the most natural algorithm: from the leaf level to the root, every node sums up the subtree sums of its two children.

3. Next, compute all nodes’ prefix sums in the following fashion. First, the prefix sum of the root is the same as its subtree sum. Now, a node in the tree can calculate its own prefix sum as long as its parent has calculated its prefix sum:
   
   - If the node is the left child of some parent, its prefix sum is its parent’s prefix sum minus its sibling’s subtree sum;
   - If the node is the right child of some parent, simply copy the parent’s prefix sum.

It is not hard to see that when executed in parallel, the above algorithm completes in \(O(n + \tilde{n})\) total work and \(O(\log(n + \tilde{n}))\) depth. We thus have the following fact.

Fact 3. The Getall algorithm runs in \(O((n + \tilde{n}) \cdot \chi)\) total work and \(O(\log(n + \tilde{n}))\) depth.

Lemma 4.1 (Perfect obliviousness of the one-time memory scheme). The above (parallel) one-time memory scheme satisfies perfect obliviousness.

Proof. It suffices to prove that for any \(S = ((k_i, v_i) : i \in [n])\) and \(K \subseteq \{k_i\}_{i \in [n]}\), the real-world distribution of \((\text{Accesses}, R)\) is identical to the ideal-world \((\text{Sim}(1^n, 1^m, 1^t), F_{\text{Getall}}(S, K))\). We proceed by defining \(\text{Sim}\), then we show the real-world \(\text{Accesses}\) is identical to \(\text{Sim}\) and that the marginal distribution of \(R\) is identical to \(F_{\text{Getall}}(S, K)\).

First, almost all parts of Build are deterministic and data oblivious and thus the algorithm’s access patterns can be simulated in the most straightforward fashion. The only randomized part of access patterns for Build is due to the oblivious random permutation. To simulate this part, the simulator calls the oblivious random permutation’s simulator.

Second, to simulate the access patterns of Lookup, for every \(i \in [m]\), the simulator would read the memory location storing \(\text{count}\) and then read the dummy index \(\text{dummy}[\text{count} + i]\). Then, it reads a random unread index of the array \(A\) and writes to it once too. Finally, it writes to \(\text{count}\) for every \(i \in [m]\).
Third, simulating the access patterns of \texttt{Getall} is done in the most natural manner since the access pattern of \texttt{Getall} is a deterministic function of the access pattern of the second step, \texttt{Lookup}.

Observe the list \(S\) is randomly permuted upfront (before \texttt{Build}) in the real-world and the added dummies (\(\bot, \bot\)) are also randomly permuted (as dummies differ only in the metadata \(\hat{M}\)). Then, in the array \(A\) generated by \texttt{Build}, every real and dummy element will be in a random location by \texttt{Intersperse} (Corollary 3.11). With a valid request sequence, the real-world algorithm \texttt{Lookup} accesses each real or dummy element at most once, and thus every real-world access visits a random position of the array \(A\) (besides reading and writing dummy and count). Hence, the marginal distribution of \texttt{Accesses} is identical to the output of \texttt{Sim}.

For the marginal variable \(R\) of real-world experiment, we use again that, in the array \(A\) generated by \texttt{Build}, every real and dummy element is in a random location. Conditioning on any fixed access pattern in the real world, the unvisited locations holds still a random unvisited real element or a random unvisited dummy (\(\bot, \bot\)). As \(R\) consists of all the unvisited real and dummy elements in the sequential ordering, it is identical to the ideal output \(F_{\text{getall}}(S, K)\).

Summarizing the above Fact 1, 2, 3, and Lemma 4.1, we conclude with the following theorem.

\begin{theorem}

The above scheme (\texttt{Build}, \texttt{Lookup}, \texttt{Getall}) is a perfectly oblivious (parallel) one-time memory. Assume that each element can be represented as \(\chi\) words, the performance is:

- \texttt{Build}: \(O((n + \tilde{n}) \cdot (\chi + \log(n + \tilde{n})))\) total work and \(O(\log^2(n + \tilde{n}))\) depth in expectation,
- \texttt{Lookup}: \(O(m\chi)\) total work and \(O(1)\) depth, and
- \texttt{Getall}: \(O((n + \tilde{n}) \cdot \chi)\) total work and \(O(\log(n + \tilde{n}))\) depth.

\end{theorem}

\section{Perfect OPRAM with Expected Performance Bound}

In this section we will put together the building blocks constructed earlier and obtain our final OPRAM scheme.

\textbf{Terminology.} Adopting the terminology of earlier works on ORAMs \cite{SvDS13,WC15,SCSL11} and OPRAMs \cite{CS17,BCP16}, we use the term \textit{block} to refer to a word of the PRAM. Recall that the PRAM makes batches of requests where each batch is of size \(m\). Our construction uses \(\chi = \Theta(\log N)\) to denote a “branching factor” which we assume is a power of 2 without loss of generality.

\subsection{Recursive OPRAMs}

Let \(D := \log_{\chi} \frac{N}{m}\). Our OPRAM construction consists of \(D + 1\) position-based OPRAMs (defined and constructed in Section 5.2) henceforth denoted OPRAM\(_0\), OPRAM\(_1\), OPRAM\(_2\), \ldots, OPRAM\(_D\) — we also refer to them as \(D + 1\) recursion depths. Each position-based OPRAM denoted OPRAM\(_d\) consists of \(d\log_{\chi} + 1\) levels geometrically growing (with factor 2) in size, where each level is a one-time oblivious memory scheme as defined and described in Section 4.

For \(d < D\), OPRAM\(_d\) stores \(\Theta(\chi^d \cdot m)\) fat-blocks where each fat-block is a bundle of \(\chi\) normal blocks (i.e., \(\chi\) words). The last recursion depth OPRAM\(_D\) stores the actual data blocks. Henceforth every OPRAM\(_d\) where \(d < D\) is said to be a \textit{metadata OPRAM}; since these OPRAMs jointly store a logical index structure for discovering the position labels of the desired (fat-)blocks in the next recursion depth. The last OPRAM\(_D\) is called the \textit{data OPRAM} since it stores the actual data blocks.
**Format of depth-\(d\) block and address.** Suppose that a block’s logical address is a \(\log_2 N\)-bit string denoted \(\text{addr}^{(D)} := \text{addr}[1..(\log_2 N)]\) (expressed in binary format), where \(\text{addr}[1]\) is the most significant bit. In general, at depth \(d\), an address \(\text{addr}^{(d)}\) is the length-\((\log_2 m + d \log_2 \chi)\) prefix of the full address \(\text{addr}^{(D)}\). Henceforth, we refer to \(\text{addr}^{(d)}\) as a depth-\(d\) address (or the depth-\(d\) truncation of \(\text{addr}\)).

When we look up a data block, we would look up the full address \(\text{addr}^{(D)}\) in recursion depth \(D\); we look up \(\text{addr}^{(D-1)}\) at depth \(D-1\), \(\text{addr}^{(D-2)}\) at depth \(D-2\), and so on. Finally at depth 0, the \(\log_2 m\)-bit address uniquely determines one of the \(m\) fat-blocks stored at \(\text{OPRAM}_0\). Since each batch consists of \(m\) concurrent lookups, one of them will be responsible for this fat-block in \(\text{OPRAM}_0\).

For \(d < D\), a fat-block with the address \(\text{addr}^{(d)}\) in \(\text{OPRAM}_d\) stores the position labels for \(\chi\) (fat-)blocks in \(\text{OPRAM}_{d+1}\), at addresses \(\{\text{addr}^{(d)}||s : s \in \{0,1\}^{\log_2 \chi}\}\). Henceforth, we say that these \(\chi\) addresses are siblings to one another.

### 5.2 Position-Based OPRAM

![Figure 1: OPRAM data structure.](image)

A position-based OPRAM is almost a fully functioning OPRAM except that whenever a batch of memory requests come in, we assume that *each request must be tagged with a correct position label indicating exactly where the requested block is in the OPRAM.* In our subsequent full OPRAM construction, to fetch a data block in \(\text{OPRAM}_D\), we recursively request the block’s position label from \(\text{OPRAM}_{D-1}\) first — once a correct position label is obtained, we may begin accessing \(\text{OPRAM}_D\) for the desired block. In other words, every \(\text{OPRAM}_d\) stores the position labels for the next \(\text{OPRAM}_{d+1}\).

#### 5.2.1 Data Structure

We next describe \(\text{OPRAM}_d\) for some \(1 \leq d \leq D = \log_\chi \frac{N}{m}\). As we shall see, the case \(\text{OPRAM}_0\) is trivial and is treated specially.
Hierarchical levels. The position-based $\text{OPRAM}_d$ consists of $d \log_2 \chi + 1$ levels henceforth denoted as $(\text{OTM}_j : j = 0, \ldots, d \log_2 \chi)$ where level $j$ is a one-time oblivious memory scheme,

$$\text{OTM}_j := \text{OTM}_j[2^j \cdot m, m, 2^j]$$

with at most $n = 2^j \cdot m$ real (fat- or data) blocks, $m$ concurrent lookups in each batch (which can all be real), and $2^j$ batch requests. This means that for every $\text{OPRAM}_d$, the smallest level is capable of storing up to $m$ real fat-blocks. Every subsequent level can store twice as many real (fat- or data) blocks as the previous level. For the largest $\text{OPRAM}_D$, its largest level is capable of storing $N$ real data blocks given that $D = \log_\chi N$ — this means that the total space consumed is $O(N)$. Plugging in the hierarchical levels, the full OPRAM data structure is illustrated in Figure 1.

Every level $j$ is marked as either empty (when the corresponding $\text{OTM}_j$ has not been rebuilt) or full (when $\text{OTM}_j$ is ready and in operation). Initially, all levels are marked as empty, i.e., the OPRAM initially is empty.

Position label. Henceforth we assume that a position label of a block specifies 1) which level the block resides in; and 2) the index within the level the block resides at.

Augmented block. We assume that each fat block or data block is of the form (logical address, payload), i.e., each block carries its own logical address. This will simplify the procedure $\text{MergeLevels}$ later.

5.2.2 Operations

Each position-based OPRAM supports two operations, $\text{Lookup}$ and $\text{Shuffle}$. In the following, we describe the algorithms $\text{Lookup}$ and $\text{Shuffle}$ for every $\text{OPRAM}_d$ where $d \geq 1$, and then we will describe the $\text{Lookup}$ and $\text{Shuffle}$ for the trivial case, $\text{OPRAM}_0$.

$\text{Lookup}$. Every batch lookup operation, denoted $\text{Lookup}((\text{addr}_i, \text{pos}_i) : i \in [m])$ receives as input the logical addresses of $m$ blocks as well as a correct position label for each requested block. To complete the batch lookup request, we perform the following.

1. For each $j = 0, \ldots, d \log_2 \chi$ in parallel, perform the following:
   - For each $i \in [m]$ in parallel, perform the following:
     - If $\text{pos}_i$ indicates that the block should be stored in level $j$, then set $\text{addr}'_i := \text{addr}_i$ and let $\text{pos}'_i := \text{pos}_i$ (and specifically the part of the position label denoting the offset within level $j$); otherwise, set $\text{addr}'_i := \bot$ and $\text{pos}'_i := \bot$.
   - $(v_{ij} : i \in [m]) \leftarrow \text{OTM}_j.\text{Lookup}((\text{addr}'_i, \text{pos}'_i) : i \in [m])$.

2. For each $i \in [m]$ in parallel, perform the following:
   - set $\text{val}_i$ to be the only non-dummy element in $(v_{ij} : j = 0, \ldots, d \log_2 \chi)$, if it exists; otherwise set $\text{val}_i := \bot$. This step can be accomplished using an oblivious select operation in $O(\log d + \log \log \chi)$ depth consuming $d \log_2 \chi$ CPUs.

3. Return $(\text{val}_i : i \in [m])$.

The following follows by Fact 2 the total work of $\text{OTM}.\text{Lookup}$.

**Fact 4.** For $\text{OPRAM}_d$, for each $\text{Lookup}$ containing a batch of $m$ requests, the total work is $O(m \cdot \chi \cdot d \cdot \log_2 \chi)$, and the depth is $O(\log d + \log \log \chi)$. 

28
**Shuffle.** A shuffle operation, denoted Shuffle($U, \ell, A_0$), receives as input an update array $U$ (we will define constraints on $U$ subsequently), the level $\ell$ to be rebuilt, and an array $A_0$ of $m$ fat- or data blocks. For each OPRAM$_d$ it must be guaranteed that $\ell \leq d \log_2 \chi$; moreover, the operation is called only when level $\ell$ is empty or $\ell = d \log_2 \chi$. The Shuffle algorithm is triggered by a previous Lookup instance, which fetches the $m$ fat- or data blocks and then passes these $m$ blocks in the array $A_0$. For the case $d = D$, the contents of these data blocks might possibly be updated.

The Shuffle algorithm then combines levels $0, 1, \ldots, \ell - 1$ into level $\ell$: at the end of the shuffle operation, all levels $0, 1, \ldots, \ell - 1$ are now marked as empty and level $\ell$ is now marked as full.

For $d = D$, the update array $U = \emptyset$; for $d < D$, the update array $U$ must satisfy the following validity requirement. Let $A := A_0 \cup (\bigcup_{i=0}^{\ell} \text{OTM}_i.\text{Getall})$, where the operator $\cup$ denotes union. We shall see that each entry of the update array $U$ contains a pair of depth-$(d + 1)$ address and the corresponding updated position label in OPRAM$_{d+1}$; moreover, if a real depth-$(d + 1)$ address appears in $U$, then its depth-$d$ prefix address must appear in $A$, whose fat-block will need update.

In our full OPRAM scheme later, the update array $U$ will be passed from the immediate larger OPRAM$_{d+1}$, and contains the new position labels that OPRAM$_{d+1}$ has chosen for recently accessed logical addresses.

As we later see, Shuffle actually needs to be performed in parallel across all recursion depths. Hence, Shuffle($U, \ell, A_0$), is actually broken into two phases as follows.

**Merge levels.** The first phase is MergeLevels($\ell, A_0$):

1. **Randomly Interspersing Adjacent Levels.** Apply oblivious random permutation (Theorem 3.4) to the $m$ elements in $A_0$. Then, for $i$ from 1 to $\ell$, perform the following:
   
   $A_i := \text{Intersperse}(A_{i-1}, \text{OTM}_{i-1}.\text{Getall})$.

   At this moment, only $A_\ell$ needs to be kept; the old OTM$_0$, \ldots, OTM$_\ell$ instances and intermediate $A_i$’s (for $i < \ell$) may be destroyed.

2. Let $(\text{OTM}', U') \leftarrow \text{OTM}_\ell.\text{Build}(A_\ell)$ (recall that each block in $A_\ell$ is augmented to carry its own logical address).

3. OTM$'$ is now the new level $\ell$ and henceforth it will be denoted OTM$\ell$. Mark level $\ell$ as full and levels $0, 1, \ldots, \ell - 1$ as empty.

4. Finally, output $U'$ (in our full OPRAM construction later, $U'$ will be passed to the the next (i.e., immediately smaller) position-based OPRAM as the update array for performing its shuffle.

**Update level.** The second phase is UpdateLevel($U, \ell$):

1. **Updating Positions.** For $d = D$, $U$ should be empty and this phase is skipped; for $d < D$, we perform the following.

   Recall that in OTM$\ell$, internally there is an array $A$ of length $2 \cdot m \cdot 2^\ell$, where each non-dummy entry is a fat-block containing $\chi$ positions in OPRAM$_{d+1}$, where each position contains the level number and the index within that level. In parallel, convert each fat-block into $\chi$ sibling entries, each of which is a pair of depth-$(d + 1)$ address together with the corresponding position. We use $M$ to denote the resulting array that contains $2\chi \cdot m \cdot 2^\ell$ such entries.

   2. Next, observe that each entry of $U$ is also a pair of depth-$(d + 1)$ address together with its updated position. However, since $U$ contains at most $m \cdot 2^\ell$ entries, we want to avoid performing oblivious sort on the whole of $M$ and $U$.  

29
The key insight is that the position of an entry in $M$ contains the level number (in OPRAM$_{d+1}$). This position needs to be updated iff its level is at most $\ell$. Hence, each such entry can be marked as special.

3. Using oblivious tight compaction (Section 3.6), we can produce an array $M'$ of size $m \cdot 2^\ell$ that contains all these special entries. While performing oblivious tight compaction, we also record the information (on the server) that later allows us to obliviously reverse the movements.

4. The entries of $M'$ are updated using the positions in $U$ via oblivious routing (Theorem 3.5), which can be implemented by oblivious sorts.

5. Then, using the recorded information (on server), the movements of tight compaction are reversed such that the updated positions in $M'$ get obliviously routed back to the special entries in $M$. Finally, from the updated array $M$, we get back the corresponding updated array $A$, which also forms the updated OTM$_{d}$.

**Fact 5.** For OPRAM$_d$, let $\ell \leq d \log_2 \chi$, then the above two phases of Shuffle($U, \ell, A$) runs in $O(m \cdot 2^\ell \cdot (\chi + \ell + \log m))$ total work and $O(\ell \cdot (\log m + \ell)^2)$ depth in expectation.

**Proof.** From the description of the algorithm, we analyze the two phases.

- The first phase MergeLevels is dominated by Intersperse and OTM$_d$.Build. The $\ell$ instances of Intersperse run on fat-blocks of size $\chi$, which takes $O(m \cdot 2^\ell \cdot (\chi + \ell + \log m))$ total work and $O(\ell \cdot (\log m + \ell)^2)$ depth. By Fact 1, OTM$_d$.Build runs in $O(m \cdot 2^\ell (\chi + \ell + \log m))$ total work and $O((\log m + \ell)^2)$ depth in expectation. Note that if $d = D$, the elements are normal data blocks of size $O(1)$, which is strictly bounded by $\chi$, and hence the total work holds as well.

- The second phase UpdateLevel is not used if $d = D$. For $d < D$, updating the positions from $U$ takes $O(m \cdot 2^\ell \cdot \chi)$ total work on address-position pairs due to tight compaction and $O(m \cdot 2^\ell \cdot (\ell + \log m))$ operations due to oblivious sorting, which sums up to the claimed total work. The depth is dominated by tight compaction, $O(\log m + \ell)$.

Combining the above gives the result. \qed

**Trivial Case:** OPRAM$_0$. OPRAM$_0$ simply stores its entries in an array $A[0..m]$ of size $m$ and we assume that the entries are indexed by a $(\log_2 m)$-bit string. Moreover, each address is also a $(\log_2 m)$-bit string, whose block is stored at the corresponding entry in $A$.

- **Lookup.** Upon receiving a batch of $m$ depth-$m$ truncated addresses where all the real addresses are distinct, use oblivious routing to route $A[0..m]$ to the requested addresses. This can be accomplished in $O(\chi m \log m)$ total work and $O(\log m)$ depth. Note that OPRAM$_0$’s lookup does not receive any position labels.

- **Shuffle.** Since there is only one array $A$ (at level 0), Shuffle($U, 0, A_0$) can be implemented by oblivious sorting, where $U$ contains the updated fat-block contents and $A_0$ is empty for OPRAM$_0$. To elaborate, OPRAM$_0$.MergeLevels shuffles $A$ and outputs $\emptyset$, and OPRAM$_0$.UpdateLevel takes $U$ as input and updates the contents of $A$. It takes $O(\chi m \log m)$ total work and $O(\log m)$ depth.

---

\footnote{Later in Section 10 we will describe how to obtain high-probability performance bounds.}
5.2.3 Analysis of Position-Based OPRAM

**Fact 6.** The construction of position-based OPRAM maintains correctness. More specifically, at every recursion depth \( d \), the correct position labels will be input to the Lookup operations of \( \text{OPRAM}_d \); and every batch of requests will return the correct answers.

*Proof.* Given as input the correct position labels, the Lookup of position-based OPRAM passes the labels to \( \text{OTM}.\text{Lookup} \), and hence the correctness follows.

In our position-based OPRAM construction, for every \( \text{OPRAM}_d \) at recursion depth \( d \), the following invariants are respected by construction as stated in the following facts. For any recursion depth \( d \), denote \( L(d) := d \log_2 \chi \) as the largest level in \( \text{OPRAM}_d \).

**Fact 7.** For every \( \text{OPRAM}_d \), every \( \text{OTM}_\ell \) instance at level \( \ell \leq L(d) \) that is created needs to answer at most \( 2^\ell \) batches of \( m \) requests before \( \text{OTM}_\ell \) instance is destroyed.

*Proof.* For every \( \text{OPRAM}_d \), the following is true: imagine that there is a \( L(d) + 1 \)-bit binary counter initialized to 0 that increments whenever a batch of \( m \) requests come in. Now, for \( 0 \leq \ell < L(d) \), whenever the \( \ell \)-th bit flips from 1 to 0, the \( \ell \)-th level of \( \text{OPRAM}_d \) is destroyed; whenever the \( \ell \)-th bit flips from 0 to 1, the \( \ell \)-th level of \( \text{OPRAM}_d \) is reconstructed. For the largest level \( L(d) \) of \( \text{OPRAM}_d \), whenever the \( L(d) \)-th (most significant) bit of this binary counter flips from 0 to 1 or from 1 to 0, the \( L(d) \)-th level is destroyed and reconstructed. The fact follows in a straightforward manner by observing this binary-counter argument.

**Fact 8.** For every \( \text{OPRAM}_d \) and every \( \text{OTM}_\ell \) instance at level \( \ell \leq L(d) \), during the lifetime of the \( \text{OTM}_\ell \) instance: (a) no two real requests will ask for the same depth-\( d \) address; and (b) for every request that asks for a real depth-\( d \) address, the address must exist in \( \text{OTM}_\ell \).

*Proof.* We first prove claim (a). Observe that for any \( \text{OPRAM}_d \), if some depth-\( d \) address \( \text{addr}^{(d)} \) is fetched from some level \( \ell \leq L(d) \), at this moment, \( \text{addr}^{(d)} \) will either enter a smaller level \( \ell' < \ell \); or some level \( \ell'' \geq \ell \) will be rebuilt and \( \text{addr}^{(d)} \) will go into level \( \ell'' \) — in the latter case, level \( \ell \) will be destroyed prior to the rebuilding of level \( \ell'' \). In either of the above cases, due to correctness of the construction, if \( \text{addr}^{(d)} \) is needed again from \( \text{OPRAM}_d \), a correct position label will be provided for \( \text{addr}^{(d)} \) such that the request will not go to level \( \ell \) (until the level is reconstructed). Finally, claim (b) follows from correctness of the position labels (Fact 6).

5.3 Detailed OPRAM Scheme

5.3.1 Operations

Upon receiving a batch of \( m \) requests denoted as \( \{(\text{op}_i, \text{addr}_i, \text{data}_i) : i \in [m]\} \), we perform the following steps.

1. **Conflict resolution.** For every depth \( d \in \{0, 1, \ldots, D\} \) in parallel, perform oblivious conflict resolution on the depth-\( d \) truncation of all \( m \) addresses requested.

   For \( d = D \), we suppress duplicate addresses. If multiple requests collide on addresses, we would prefer a write request over a read request (since write requests also fetch the old memory value back before overwriting it with a new value). In the case of concurrent write operations to the same address, we use the properties of the underlying PRAM to determine which write operation prevails.
For $0 \leq d < D$, after conflict resolution, the $m$ requests for $\text{OPRAM}_d$ become $((\text{addr}_i^{(d)}, \text{flags}_i) : i \in [m])$, where each non-dummy depth-$d$ truncated address $\text{addr}_i^{(d)}$ is distinct and has a $\chi$-bit flags that indicates whether each of the $\chi$ sibling addresses $\{\text{addr}_i^{(d)} | s : s \in \{0, 1\}^{\log_2 \chi}\}$ is requested in $\text{OPRAM}_{d+1}$.

For completeness, we briefly describe the conflict resolution procedure for $1 \leq d < D$ as follows:

(a) Consider the depth-$(d+1)$ truncated address: $A^{(d+1)} := (\text{addr}_1^{(d+1)}, \ldots, \text{addr}_m^{(d+1)})$, and use oblivious sorting to suppress duplicates of depth-$(d+1)$ addresses, i.e., each repeated depth-$(d+1)$ address is replaced by a dummy. Let $\hat{A}^{(d+1)}$ be the resulting array (of size $m$) sorted by the (unique) depth-$(d+1)$ addresses.

(b) Using $\hat{A}^{(d+1)}$, for each $i \in [1..m]$, we produce an entry $(\text{addr}_i^{(d)}, \text{flags}_i)$ according to the following rules:
   i. If $\text{addr}_i^{(d+1)}$ is a dummy, then $\text{addr}_i^{(d)} := \bot$ and $\text{flags}_i := \bot$ are also dummy.
   ii. Observe that all addresses with the same depth-$d$ prefix are grouped together. Hence, within such a group, we only need to keep the last one (truncated to its depth-$d$ prefix) and add a $\chi$-bit flag to indicate which of the sibling addresses are present. All other addresses are set to dummy.

(c) The batch access for $\text{OPRAM}_d$ is $((\text{addr}_i^{(d)}, \text{flags}_i) : i \in [m])$.

As noted by prior works [BCP16, CLT16, CS17], conflict resolution can be completed by employing oblivious sorting.

2. Fetch. For $d = 0$ to $D$ sequentially, perform the following:

- For each $i \in [m]$ in parallel: let $(\text{addr}_i^{(d)}, \text{flags}_i)$ be the depth-$d$ result of conflict resolution.
- Call $\text{OPRAM}_d$.Lookup to look up the depth-$d$ addresses $\text{addr}_i^{(d)}$ for all $i \in [m]$; observe that position labels $P^{(d)}$ for the lookups of non-dummy addresses will be available from the lookup of the previous $\text{OPRAM}_{d-1}$ for $d \geq 1$, which is described in the next step. Recall that for $\text{OPRAM}_0$, no position labels are needed. We use $A_d$ to denote the $m$ (fat- or data) blocks returned from the lookup of $\text{OPRAM}_d$, and proceed with the following two cases of $d$.
- If $d < D$, each lookup from a non-dummy $(\text{addr}_i^{(d)}, \text{flags}_i)$ will return positions for the $\chi$ sibling addresses $\{\text{addr}_i^{(d)} | s : s \in \{0, 1\}^{\log_2 \chi}\}$. The $\chi$ bits in flags will determine whether each of these $\chi$ position labels will be “needed” later in the lookup of $\text{OPRAM}_{d+1}$.
   At recursion depth $d+1$, there are $m$ CPUs waiting for the position labels corresponding to $\{\text{addr}_i^{(d+1)} : i \in [m]\}$. At depth $d$, there are $\chi$ (real or dummy) labels per CPU. To get the $m$ labels needed at depth $d+1$, run tight compaction on the $\chi \cdot m$ labels such that moves the $m$ needed positions to the front. Now, using oblivious routing (see Theorem 3.5), the $m$ position labels $P^{(d+1)}$ can be delivered to the $m$ CPUs at recursion depth $d+1$.
- If $d = D$, $A_D$ will contain the data blocks fetched. Recall that conflict resolution was used to suppress duplicates. Hence, oblivious routing can be used to deliver each data block to the corresponding CPUs that request it.

3. Maintain. We first consider depth $D$. For every $i \in [m]$ in parallel: set $u_i := (\text{addr}_i^{(D)}, \text{data}_i)$, where $\text{data}_i$ is the updated data block for the address $\text{addr}_i^{(D)}$ (or just the original data block if
it is not modified). Set the array \( A_D := \{ u_i : i \in [m] \} \) and \( U^{(D)} := \emptyset \). Suppose that \( \ell^{(D)} \) is the smallest empty level in \( \text{OPRAM}_D \).

For \( 1 \leq d < D \), recall \( A_d \) denote the \( m \) fat-blocks that are returned from the lookup of \( \text{OPRAM}_d \); for the trivial case \( d = 0 \), \( A_0 := \emptyset \). By the construction of \( \text{OPRAM}_d \), we have the invariant that for all \( 0 \leq d < D \), if \( \ell^{(D)} < d \log_2 \chi \) (recall that \( \text{OPRAM}_d \) consists of \( d \log_2 \chi \) levels), then \( \ell^{(D)} \) is also the smallest empty level in \( \text{OPRAM}_d \).

For \( d := D \) down to 0, each of the following steps is done in parallel across different \( d \)'s:

- If \( d \log_2 \chi < \ell^{(D)} \), set \( \ell_d := d \log_2 \chi \); otherwise, set \( \ell_d := \ell^{(D)} \).
- Call \( U^{(d-1)} \leftarrow \text{OPRAM}_d.\text{MergeLevels}(\ell_d, A_d) \).
- After the previous step, all \( U^{(d)} \)'s are ready. Hence, we call \( \text{OPRAM}_d.\text{UpdateLevel}(U^{(d)}, \ell_d) \).

### 5.3.2 Analysis of OPRAM Scheme

#### Obliviousness

Given Fact 7 and 8, our OPRAM construction maintains perfect obliviousness.

**Lemma 5.1** (Obliviousness). The above OPRAM construction satisfies perfect obliviousness.

**Proof.** For every parallel one-time memory instance constructed during the lifetime of the OPRAM, Facts 7 and 8 are satisfied, and thus every one-time memory instance receives a valid request sequence. Putting together the perfect obliviousness of the parallel one-time memory scheme (Lemma 4.1) and the **Intersperse** (Corollary 3.11), the position-based OPRAM is perfectly oblivious — the output of OTM.\text{Getall} is uniformly random shuffled, then the result of **Intersperse** is uniformly random shuffled, and then the input to the subsequent OTM.\text{Build} is uniformly random shuffled, which implies such OTM is perfectly oblivious. The perfect obliviousness of the position-based OPRAM and then the full OPRAM follows by observing that all other access patterns of the construction are identically distributed and independent of the input requests. \( \square \)

#### Efficiency

We now analyze the asymptotical efficiency of our OPRAM construction. First, observe that the asymptotical performance of the conflict resolution and fetch phases as stated in the following fact.

**Fact 9.** The fetch phase can be completed using \( \mathcal{O}(Dm\chi \log N) = \mathcal{O}\left(\frac{m \log^4 N}{\log \log N}\right) \) total work and \( \mathcal{O}(D : (\log \log N + \log m)) = \mathcal{O}(\log N : (\log \log N + \log m)) \) depth.

**Proof.** The factor \( D \) comes from the number of recursion depths. For each recursion depth, the following costs are incurred: 1) Within each recursion depth, there are \( \mathcal{O}(\log \frac{N}{m}) \) hierarchical levels. Each of the \( m \) requests accesses and computes on one fat- or data block per level (Fact 4). 2) The routing between adjacent depths can be implemented with the AKS sorting network \text{[AKS83]} that moves \( \mathcal{O}(m \log m) \) fat-blocks, which takes \( \mathcal{O}(\chi m \log m) \) total work and \( \mathcal{O}(\log m) \) depth. Hence, the total work is

\[
\mathcal{O}(Dm\chi \log \frac{N}{m}) + \mathcal{O}(D\chi m \log m) = \mathcal{O}(Dm\chi \log N) = \mathcal{O}\left(\frac{m \log^3 N}{\log \log N}\right),
\]

because \( \chi = \Theta(\log N) \). The depth is \( \mathcal{O}(D : (\log \log N + \log m)) \), where \( \mathcal{O}(\log \log N) \) is the depth of each \( \text{OPRAM}_d.\text{Lookup} \) (Fact 4). \( \square \)

We now proceed to analyze the efficiency of the maintain phase.
Fact 10. Let $T$ be the total steps of the original PRAM (where each step contains $m$ memory concurrent requests). Using $m$ CPUs, the maintain phase of OPRAM takes $T \cdot O\left(\frac{\log^3 N}{\log \log N}\right)$ parallel steps in expectation.

Proof. After every $2^\ell$ batch of $m$ requests, for each OPRAM$_d$, the level $\ell$ is reconstructed. Due to Fact 5, each such reconstruction will take $O(m \cdot 2^\ell \cdot (\chi + \ell + \log m))$ total work in expectation for each $d$. Summing from $d = 0$ to $D$, the total work is $O(D \cdot m \cdot 2^\ell \cdot (\chi + \ell + \log m))$. The depth of all $D$ reconstructions is still $O(\ell \cdot (\log m + \ell)^2)$ from Fact 5 because both OPRAM$_d$.MergeLevels and OPRAM$_d$.UpdateLevel($U^{(d)}$, $\ell_d$) are performed in parallel across different $d$'s. Because the depth is less than the total work divided by $m$, the number of parallel steps is just $O(D \cdot 2^\ell \cdot (\chi + \ell + \log m))$ for each $\ell$.

Then, during $T$ batch of requests, for each $\ell$ such that $2^\ell \leq T$, it takes $\left\lceil \frac{T}{2^\ell} \right\rceil \cdot O(D \cdot 2^\ell \cdot (\chi + \ell + \log m)) = T \cdot O(D \cdot (\chi + \ell + \log m))$ parallel steps. Summing over all levels $\ell = 0$ to $D \log_2 \chi$, the total number of parallel steps is

$$T \cdot O(D \cdot ((\chi + \log m)D \log_2 \chi + D^2 \log^2 \chi)) = T \cdot O\left(\frac{\log^3 N}{\log \log N}\right)$$

in expectation, because $\chi = \Theta(\log N)$ and $D = O\left(\frac{\log N}{\log \chi}\right)$.

Theorem 5.2 (Formal statement of Theorem 1.1). The OPRAM construction achieves $O\left(\frac{\log^3 N}{\log \log N}\right)$ simulation overhead in expectation for any sequence of operations.

Proof. Straightforward from Lemma 5.1, Facts 9 and 10.

6 Perfect OPRAM with High-Probability and Deterministic Bounds

So far, we have focused on the expected performance of our OPRAM construction. We can in fact upgrade our performance guarantees to hold with high probability or deterministically (at the cost of logarithmically greater overhead). Observe that in our OPRAM construction, only oblivious random permutation (Section 3.2.2) and intersperse (Section 3.3) may run longer than the expected cost. To obtain an OPRAM that takes a claimed overhead with high probability or deterministically, it suffices to instantiate oblivious random permutation and intersperse that run in deterministic cost.

6.1 Deterministic Performance

Concretely, the OPRAM in Section 5 is modified as follows to achieve deterministic performance.

OPRAM scheme with deterministic performance. The construction is similar to that of Section 5.3.1 and the only difference is below:

1. Implement every oblivious random permutation using the algorithm of Theorem 3.3 (i.e., replacing Theorem 3.4). Recall that oblivious random permutation is used in the following procedure:

   • The Build procedure of one-time memory (Step 4, Section 4.2.2).

2. Implement every parallel intersperse using the algorithm of Corollary 3.9 (i.e., replacing Corollary 3.11). Recall that parallel intersperse is used in the following procedures:
The Build procedure of one-time memory (Step 2, Section 4.2.2).

The MergeLevels procedure of position-based OPRAM (Step 1, Section 5.2.2).

Also, note that the new parallel intersperse algorithm (Corollary 3.9) also uses the new oblivious random permutation (Theorem 3.3) as an internal building block.

For both building blocks, i.e., oblivious random permutation and intersperse, the new modification takes logarithmically more total work while using the same depth, and the performance bounds are now deterministic. Hence, for the resulting OPRAM, obliviousness and correctness follow directly, and it takes logarithmically more overhead compared to Theorem 5.2. We thus have the following theorem.

**Theorem 6.1** (Formal statement of Theorem 1.3: OPRAM part). For any \( N \), there exists a perfect OPRAM scheme whose simulation overhead is \( O\left(\frac{\log^4 N}{\log \log N}\right) \) deterministically.

If we do not need parallelism, the simulation overhead can be further reduced by a logarithmic factor. Basically, to construct a perfect ORAM with deterministic performance bounds, we can instantiate the oblivious random permutation with the sequential variant of Asharov et al. [AKL+20a, Theorem 4.6]. Similarly, the oblivious random permutation in the Intersperse algorithm (Corollary 3.11) is also replaced with the sequential version Theorem 3.7. Since the sequential version is a logarithmic factor more efficient in work than the parallel version, we get the following corollary.

**Corollary 6.2** (Formal statement of Theorem 1.3: ORAM part). For any \( N \), there exists a perfect ORAM scheme whose simulation overhead is \( O\left(\frac{\log^3 N}{\log \log N}\right) \) deterministically.

### 6.2 High-Probability Performance

We now construct a perfect OPRAM that achieves simulation overhead \( O\left(\frac{\log^3 N}{\log \log N} + \frac{\log^2 N}{\log \log N} \cdot \text{poly} \log \log \frac{1}{\delta}\right) \) with \( 1 - O(T \cdot \delta) \) probability, where \( T \) is the parallel runtime of the original PRAM. As will become later, the \( T \) factor is due to taking a union bound over all time steps. Typically, one might want to set \( T \cdot \delta = \text{negl}(N) \) — in this case, assuming that \( T = \text{poly}(N) \), our perfect OPRAM scheme achieves \( O\left(\frac{\log^3 N}{\log \log N}\right) \) simulation overhead with all but \( \text{negl}(N) \) probability. Observe that the Las Vegas oblivious random permutation (Theorem 3.4) is more likely to exceed the performance bound \( n \log n \) when the problem size \( n \) is smaller — more specifically, the failure probability is upper bounded by \( n^{-\Omega(\sqrt{n})} \). Hence, we would like to apply the Las Vegas oblivious random permutation only when \( n \) is large enough so that \( n^{-\Omega(\sqrt{n})} \) is less than the given probability \( \delta \); in cases of small \( n \), we simply apply an oblivious random permutation with deterministic performance. The running time of intersperse (Corollary 3.11) is bounded with similar probability, and thus we take a similar approach. The modification is described below.

**OPRAM scheme with high probability performance.** Given probability \( \delta < 1 \), let \( \sigma := \max\{m, \log^2(1/\delta)\} \) for short. The construction is similar to that of Section 5.3.1 and the only difference is below:

1. Implement oblivious random permutation using the algorithm of Theorem 3.3 (i.e., replacing Theorem 3.4) only when the input size \( n < \sigma \), i.e., for levels \( i < \log(\sigma/m) \). Otherwise, Theorem 3.4 is still applied. Recall that oblivious random permutation is used in the Build procedure of one-time memory (Step 4, Section 4.2.2).
2. Implement parallel intersperse using the algorithm of Corollary 3.9 (i.e., replacing Corollary 3.11) only when the input size $n < \sigma$, i.e., for levels $i < \log(\sigma/m)$. Otherwise Corollary 3.11 is still applied. Recall that the parallel intersperse is used in the Build procedure of one-time memory (Step 2, Section 4.2.2), and in the MergeLevels procedure of position-based OPRAM (Step 1, Section 5.2.2). Note that for the oblivious random permutation inside the intersperse procedure, we use Theorem 3.3.

**Analysis.** The correctness and obliviousness follows directly. We start from the high-probability performance bounds for one-time memory and and position-based OPRAM.

**Fact 11** (High probability performance bounds for one-time memory). There exists a perfectly oblivious one-time memory scheme with the following performance. For $n \geq \sigma$, the Build algorithm completes in $O((n+\tilde{n}) \cdot (\chi + \log(n+\tilde{n})))$ total work and $O(\log^2(n+\tilde{n}))$ depth, except with probability $\delta^\Omega(1)$. For $n < \sigma$, it completes, with probability 1, in $O((n+\tilde{n}) \cdot (\chi + \log^2(n+\tilde{n})))$ total work and $O(\log^2(n+\tilde{n}))$ depth.

**Proof.** The $n \geq \sigma$ case is similar to the proof of Fact 1, and the only difference is that we plug in the high-probability bounds of intersperse and random permutation, which are both $n^{-\Omega(\sqrt{n})}$. Hence, the probability of running longer is at most $\exp(-\Omega(n^{1/2})) \leq \delta^\Omega(1)$. The $n < \sigma$ case follows by Theorem 3.3 and Corollary 3.9. \qed

**Fact 12** (High probability performance bounds for position-based OPRAM). For any OPRAM$_d$, let $\ell \leq d \log_2 \chi$, then the two phases of Shuffle($U, \ell, A$) (i.e., MergeLevels and UpdateLevel) have the following performance:

- For $m \cdot 2^\ell < \sigma$, with probability 1, it completes in $O(m \cdot 2^\ell \cdot (\chi + (\ell + \log m)^2) + \chi m \log m)$ total work and $O(\ell \cdot (\ell + \log m)^2)$ depth.

- For $m \cdot 2^\ell \geq \sigma$, except with probability $\ell \cdot \delta^\Omega(1)$, it completes with

$$
\begin{cases}
O(m \cdot 2^\ell \cdot (\chi + \ell + \log m) + \sigma \cdot (\chi + \log^2 \sigma) + \chi m \log m) & \text{total work, and} \\
O(\ell \cdot (\ell + \log m)^2 + \log^3 \sigma) & \text{depth}.
\end{cases}
$$

**Proof.** We first analyze MergeLevels. For small instances, $m \cdot 2^\ell < \sigma$, the total cost of MergeLevels is dominated by the non-Las-Vegas variants of Intersperse, ORP (only in the case OPRAM$_0$), and OTM$_d$.Build, while the depth is dominated by performing Intersperse for $\ell$ times. Hence, by Corollary 3.9, Theorem 3.3, and Fact 11, it takes $O(m \cdot 2^\ell \cdot (\chi + (\ell + \log m)^2) + \chi m \log m)$ total work and $O(\ell \cdot (\ell + \log m)^2)$ depth. Otherwise, $m \cdot 2^\ell \geq \sigma$, in MergeLevels, the first $\log(\sigma/m)$ levels of non-Las-Vegas Intersperse take $O(\sigma \cdot (\chi + \log^2 \sigma) + \chi m \log m)$ total work and $O(\log^3 \sigma)$ depth; The remaining $O(\ell)$ levels of Intersperse and one OTM$_d$.Build are both Las Vegas, and thus the analysis is similar to the analysis of Fact 5, where the only difference is taking union bound over the probability that any of such subroutines run longer, where the probability is $\delta^\Omega(1)$ for each Intersperse (Corollary 3.11) and OTM$_d$.Build (Fact 11). Adding up, except with probability $\ell \cdot \delta^\Omega(1)$, it takes the claimed total work and depth.

Then, in the second phase, UpdateLevel is constructed identical to the construction of Fact 5, and hence the total work and depth is the same and absorbed by MergeLevels in both cases. \qed

We next give the high probability statement for the maintain phase of the OPRAM.
Fact 13. Let $T$ be the total steps of the original PRAM (where each step contains $m$ memory concurrent requests). For any $\delta < 1$ and $N > \sigma$, except with $T \cdot \delta^{\Omega(1)}$ probability, the maintain phase of the OPRAM scheme incurs $T \cdot O\left(\frac{\log^3 N}{\log \log N} + \frac{\log^2 N \cdot \log(1/\delta)}{\log \log N} + \frac{\log N \cdot \log^3 \sigma}{\log \log N}\right)$ parallel steps consuming $m$ CPUs. Otherwise, for $N \leq \sigma$, the maintain phase incurs $T \cdot O\left(\frac{\log^4 N}{\log \log N}\right)$ parallel steps with probability 1.

Proof. If $N \leq \sigma$, then the scheme is identical to the deterministic one (Section 6.1) and the performance bound follows by Theorem 6.1.

If $N > \sigma$, we have levels $\ell$ from 0 to $\log(\sigma/m)$ falls to the small case and the remaining $O(D \log_2 \chi)$ levels falls to the large case in Fact 12. During $T$ batches of requests, for each $\ell$ such that $m \cdot 2^\ell < \sigma$, we perform $DT/m2^\ell$ instances of $\text{Shuffle}(U, \ell, A)$ for all $D$ recursion depths, and hence the maintenance on the $\log(\sigma/m)$ small levels takes

$$T \cdot O\left(D \cdot \left(\log \sigma \cdot (\log^2 m + \chi \log m) + \log^2 \sigma \cdot \log m + \log^3 \sigma\right)\right)$$

$$= T \cdot O\left(\frac{\log^2 N \cdot \log^2 \sigma + \log N \cdot \log^3 \sigma}{\log \log N}\right)$$

parallel steps (recall that $m \leq \sigma$). On the $O(D \log_2 \chi)$ large levels, the calculation is similar to the proof of Fact 10 and the only difference is the probability of taking running longer. Hence, the total number of parallel steps is

$$T \cdot O\left(\frac{\log^3 N}{\log \log N} + \frac{\log^2 N \cdot \log^2 \sigma + \log N \cdot \log^3 \sigma}{\log \log N}\right).$$

During any sequence of $T$ requests, the Las Vegas subroutines $\text{Intersperse}$ or $\text{ORP}$ are called by the maintain phase (indirectly from the position-based OPRAM or one-time memory) for at most $T$ times and the probability bound holds by taking union bound over Corollary 3.11 and Theorem 3.4, which yields $T \cdot \delta^{\Omega(1)}$.

Theorem 6.3 (Formal statement of Theorem 1.2). For any $\delta < 1$ and capacity $N \in \mathbb{N}$, there exists a perfect OPRAM scheme whose simulation overhead is upper bounded by

$$O\left(\frac{\log^3 N}{\log \log N} + \frac{\log^2 N \cdot \log^2 \log(1/\delta) + \log N \cdot \log^3 \log(1/\delta)}{\log \log N}\right)$$

with probability $1 - O(T \cdot \delta)$ where $T$ denotes the parallel runtime of the original PRAM being simulated obliviously by the OPRAM.

Proof. It follows from Lemma 5.1, Facts 9 and 13. We remark that our analysis of the claimed probability uses only union bound in Fact 13 and one can obtain a tighter probability when $T$ is large via concentration bounds as all events of running longer are independent.

References


A Why the KLO Rebalancing Trick Fails for Perfect ORAMs

To asymptotically improve the overhead, one promising idea is to somehow balance the fetch and maintain phases. This idea has been explored in computationally secure ORAMs first by Kushilevitz et al. [KLO12] and later improved in subsequent works [CGLS17]. Kushilevitz et al.’s idea is essentially a reparametrization trick that works for a (computationally secure) hierarchical ORAM.

We will explain Kushilevitz et al.’s trick pretending that the underlying hierarchical ORAM construction is the position-based ORAM scheme in Section 2.1.1 — we then explain why this particular trick is in fact, incompatible with our perfectly secure (position-based) ORAM (even though it worked in the context of earlier computationally secure schemes).

Basically, instead of having log_2 N many levels of doubling sizes, we now have $L := \log_\chi N + 1$ super-levels numbered 0, 1, ..., $L - 1$ where $\chi > 2$ is called the branching factor. Except for the largest super-level $L - 1$, each super-level $\ell < L - 1$ contains $\chi$ copies of normal levels where each level stores at most $\chi^{\ell}$ real blocks (and an appropriate number of dummies). The largest level can store $N$ real blocks and an appropriate number of dummy blocks.

A super-level is full iff all levels within it are full. Now for the maintain phase, suppose that super-levels 0, 1, ..., $\ell^*$ are all full (and suppose that $\ell^* + 1$ is not the largest level), we will merge all super-levels 0, ..., $\ell^*$ (as well as the most recently fetched block) into an empty level contained within the super-level $\ell^* + 1$. If all super-levels are full, then we merge all super-levels into the largest super-level. For the fetch phase, we need to read a block from every level residing in every super-level.
Assuming that this reparametrization had worked, then the fetch phase would incur $O(\chi \log \chi N)$ where $\chi$ stems from the number of levels within each super-level, and $\log \chi N$ comes from the number of levels. For the maintain phase, every $\chi^\ell$ accesses we would need to rebuild a level of capacity $\chi^\ell$. Thus (pretending for the time being that the number of dummies in a level equals the number of real blocks) the amortized cost incurred by the maintain phase would be $\log \chi N \log_2 N$ where the $\log_2 N$ factor comes from the oblivious sorting and the $\log \chi N$ factor can be thought of as coming from the number of super-levels. Now observe if we set $\chi = \log N$, both fetch- and maintain-phases would incur $O(\log^2 N / \log \log N)$; thus overall we achieve $\log \log N$ factor improvement over Section 2.1.1.

There is, however, a flaw in the above argument. In this reparametrized scheme, it is not hard to see that a level of capacity $\chi^\ell$ (contained inside a super-level) is accessed $\chi^{\ell+1}$ times before it is rebuilt. To ensure that each location is accessed at most once, the level needs to contain $\chi^{\ell+1}$ number of dummies. Since the number of dummies is $\chi$ times larger than the level’s capacity (i.e., the number of real blocks the level stores), the actual (amortized) cost of the maintain phase is $\chi$ factor greater than our analysis above.

In comparison, in earlier computationally secure ORAM constructions [KLO12,CGLS17], each level is a data structure called an “oblivious hash table” where dummies need not be over-provisioned relative to the number of reals (contrary to our perfectly secure construction in Section 2.1.1); and yet such an “oblivious hash table” can support unbounded number of accesses, as long as each real block is requested at most once [KLO12,CGLS17]. This explains why this rebalancing trick worked in the context of earlier computationally secure ORAMs.