Abstract

We build a system that provides succinct non-interactive zero-knowledge proofs (zk-SNARKs) for program executions on a von Neumann RISC architecture. The system has two components: a cryptographic proof system for verifying satisfiability of arithmetic circuits, and a circuit generator to translate program executions to such circuits. Our design of both components improves in functionality and efficiency over prior work, as follows.

Our circuit generator is the first to be universal: it does not need to know the program, but only a bound on its running time. Moreover, the size of the output circuit depends additively (rather than multiplicatively) on program size, allowing verification of larger programs.

The cryptographic proof system improves proving and verification times, by leveraging new algorithms and a pairing library tailored to the protocol.

We evaluated our system for programs with up to 10,000 instructions, running for up to 32,000 machine steps, each of which can arbitrarily access random-access memory; and also demonstrated it executing programs that use just-in-time compilation. Our proofs are 230 bytes long at 80 bits of security, or 288 bytes long at 128 bits of security. Typical verification time is 5 milliseconds, regardless of the original program’s running time.

Keywords: zero-knowledge, succinct arguments, computationally-sound proofs
Contents

1 Introduction 3
  1.1 Goal ................................................. 3
  1.2 Prior work ......................................... 3
  1.3 Limitations of prior work on zk-SNARKs .................. 4
  1.4 Results .............................................. 5
  1.5 Roadmap ............................................ 7

2 Preliminaries 7
  2.1 Notation ............................................. 7
  2.2 Arithmetic circuits .................................. 8
  2.3 Quadratic arithmetic programs ......................... 8
  2.4 Pairings ............................................. 9
  2.5 zk-SNARKs for arithmetic circuits ..................... 9
  2.6 A von Neumann RISC architecture .................... 10

3 Our circuit generator 10
  3.1 Past techniques ..................................... 11
  3.2 Our construction .................................... 12

4 Our zk-SNARK for circuits 14
  4.1 The PGHR protocol and the two elliptic curves ......... 14
  4.2 An optimized verifier .................................. 15
  4.3 An optimized prover ................................... 16
  4.4 An optimized key generator ............................. 16

5 Evaluation 17
  5.1 Performance of our circuit generator .................. 17
  5.2 Performance of our zk-SNARK for circuit satisfiability . 19
  5.3 Performance of the combined system .................... 19
  5.4 Comparison with prior work ............................ 21

6 Conclusion 22

A Other prior work 24

B The PGHR zk-SNARK protocol 25

C Additional experimental data 26

D zk-SNARKs for vTinyRAM 30
  D.1 Informal definition .................................. 30
  D.2 Construction ........................................ 31

E Case study: memcpy with just-in-time compilation 32

References 33
1 Introduction

1.1 Goal

Consider the setting where a client owns a public input $x$, a server owns a private input $w$, and the client wishes to learn $z := F(x, w)$ for a program $F$ known to both parties. For instance, $x$ may be a query, $w$ a confidential database, and $F$ the program that executes the query on the database.

Security. The client is concerned about integrity of computation: how can he ascertain that the server reports the correct output $z$? In contrast, the server is concerned about confidentiality of his own input: how can he prevent the client from learning information about $w$?

Cryptography offers a powerful tool to address these security concerns: zero-knowledge proofs [GMR89]. The server, acting as the prover, attempts to convince the client, acting as the verifier, that the following NP statement is true: “there exists $w$ such that $z = F(x, w)$”. Indeed:

- The soundness property of the proof system guarantees that, if the NP statement is false, the prover cannot convince the verifier (with high probability). Thus, soundness addresses the client’s integrity concern.
- The zero-knowledge property of the proof system guarantees that, if the NP statement is true, the prover can convince the verifier without leaking any information about $w$ (beyond was is leaked by the output $z$).

Thus, zero knowledge addresses the server’s confidentiality concern.

Moreover, the client sometimes not only seeks soundness but also proof of knowledge [GMR89, BG93], which guarantees that, whenever he is convinced, not only can he deduce that a witness $w$ exists, but also that the server knows one such witness. This stronger property is often necessary to security if $F$ encodes cryptographic computations, and is satisfied by most zero-knowledge proof systems.

Efficiency. Besides the aforementioned security desiderata, many settings also call for efficiency desiderata. The client may be either unable or unwilling to engage in lengthy interactions with the server, or to perform large computations beyond the “bare minimum” of sending the input $x$ and receiving the output $z$. For instance, the client may be a computationally-weak device with intermittent connectivity (e.g., a smartphone).

Thus, it is desirable for the proof to be non-interactive [BFM88, NY90, BDSMP91]: the server just send the claimed output $\tilde{z}$, along with a non-interactive proof string $\pi$ that attests that $\tilde{z}$ is the correct output. Moreover, it is also desirable for the proof to be succinct: $\pi$ has size $O_\lambda(1)$ and can be verified in time $O_\lambda(|F| + |x| + |z|)$, where $O_\lambda(\cdot)$ is some polynomial in a security parameter $\lambda$; in other words, $\pi$ is very short and easy to verify (i.e., verification time does not depend on $|w|$, nor $F$’s running time).

zk-SNARKs. A proof system achieving the above security and efficiency desiderata is called a (publicly-verifiable) zero-knowledge Succinct Non-interactive ARgument of Knowledge (zk-SNARK). zk-SNARK constructions can be applied to a wide range of security applications, provided these constructions deliver good enough efficiency, and support rich enough functionality (i.e., the class of programs $F$ that is supported).

Remark 1.1. In the zero-knowledge setting above, the client does not have the server’s input, and so cannot conduct the computation on his own. Hence, it is not meaningful to compare “efficiency of outsourced computation at the server” and “efficiency of native execution at the client”, because the latter was never an option. Non-interactive zero-knowledge proofs (and zk-SNARKs) are useful regardless of cross-over points.

Our goal in this paper is to construct

a zk-SNARK implementation supporting executions on a universal von Neumann RISC machine.

1.2 Prior work

zk-SNARKs. Many works have obtained zk-SNARK constructions [Gro10a, Lip12, GGPR13, BCIOP13, PGHR13, BCGTV13a, Lip13, BFRS+13]. Three of these [PGHR13, BCGTV13a, BFRS+13] provide implementations, and thus we briefly recall them. Parno et al. [PGHR13] present two main contributions.
• A zk-SNARK, with essentially-optimal asymptotics, for arithmetic circuit satisfiability, based on quadratic arithmetic programs (QAPs) [GGPR13]. They accompany their construction with an implementation.

• A compiler that maps C programs with fixed memory accesses and bounded control flow (e.g., array accesses and loop iteration bounds are compile-time constants) into corresponding arithmetic circuits. Ben-Sasson et al. [BCGTV13a] present three main contributions.

• Also a QAP-based zk-SNARK with essentially-optimal asymptotics for arithmetic circuit satisfiability, and a corresponding implementation. Their construction follows the linear-interactive proofs of [BCIOP13].

• A simple RISC architecture, TinyRAM, along with a circuit generator for generating arithmetic circuits that verify correct execution of TinyRAM programs.

• A compiler that, given a C program, produces a corresponding TinyRAM program. Thus, both [PGHR13, BCGTV13a] have two main components: a zk-SNARK for a low-level language, and method to translate a high-level language to the low-level language. Finally, Braun et al. [BFRS+13] re-implemented the protocol of [PGHR13] and combined it with a circuit generator that incorporates memory-checking techniques [BEGKN91] to support random-access memory [BCGT13a].

**Outsourcing computation to powerful servers.** Numerous works [SBW11, SMBW12, SVPB+12, SBVB+13, CMT12, TRMP12, VSBW13, Tha13, BFRS+13] seek to verifiably outsource computation to untrusted powerful servers, e.g., in order to make use of cheaper cycles or storage. (See Appendix A for a summary.) We stress that verifiable outsourcing of computations is not our goal. Rather, as mentioned, we study functionality and efficiency aspects of non-interactive zero-knowledge proofs, which are useful even when applied to relatively-small computations, and even with high overheads.

Compared to most protocols to outsource computations, known zk-SNARKs use “heavyweight” techniques, such as probabilistically-checkable proofs [BFLS91] and expensive pairing-based cryptography. The optimal choice of protocol, and whether it actually pays off compared to local native execution, are complex, computation-dependent questions [VSBW13], and we leave to future work the question of whether zk-SNARKs are useful for the goal of outsourcing computations.

### 1.3 Limitations of prior work on zk-SNARKs

Recent work has made tremendous progress in taking zk-SNARKs from asymptotic theory into concrete implementations. Yet, known implementations suffer from several limitations.

**Per-program key generation.** As in any non-interactive zero-knowledge proof, a zk-SNARK requires a one-time trusted setup of public parameters: a key generator samples a proving key (used to generate proofs) and a verification key (used to check proofs). However, current zk-SNARK implementations [PGHR13, BCGTV13a] require the setup phase to depend on the program $F$, which is hard-coded in the keys. Key generation is costly (quasilinear in $F$’s runtime) and is thus difficult to amortize if conducted anew for each program. More importantly, per-program key generation requires, for each new choice of program, a trusted party’s help.

**Limited support for high-level languages.** Known circuit generators have limited functionality or efficiency: (i) [PGHR13]’s circuit generator only supports programs without data dependencies, since memory accesses and loop iteration bounds cannot depend on a program’s input; (ii) [BFRS+13]’s circuit generator allows data-dependent memory accesses, but each such access requires expensive hashing to verify Merkle-tree authentication paths; (iii) [BCGT13a]’s circuit generator supports arbitrary programs but its circuit size scales inefficiently with program size (namely, it has size $\Omega(\ell T)$ for $\ell$-instruction $T$-step TinyRAM programs). Moreover, while there are techniques that mitigate some of the above limitations [ZE13], these only apply in special cases, and not do address general data dependencies, a common occurrence in many programs.

Ultimately, large general programs rely on external libraries (providing, e.g., mathematical subroutines or data structures), which contribute to program size. Thus, it is crucial to seek circuits that simultaneously...
support arbitrary programs and that efficiently scale with program size. 

**Generic sub-algorithms.** The aforementioned zk-SNARKs use several sub-algorithms, and in particular elliptic curves and pairings. Protocol-specific optimizations are a key ingredient in fast implementations of pairing-based protocols [Sc05], yet prior implementations only utilize off-the-shelf cryptographic libraries, and miss key optimization opportunities.

1.4 Results

We present two main contributions: a new circuit generator and a new zk-SNARK for circuits. These can be used independently, or combined to obtain an overall system.

1.4.1 A new circuit generator

We design and build a new circuit generator that incorporates the following two main improvements.

(1) Our circuit generator is universal: when given input bounds \(\ell, n, T\), it produces a circuit that can verify the execution of any program with \(\leq \ell\) instructions, on any input of size \(\leq n\), for \(\leq T\) steps. Instead, all prior circuit generators [SVPB+12, SBVB+13, PGHR13, BCGTV13a, BFRS+13] hardcoded the program in the circuit. Combined with a zk-SNARK for circuits (or any NP proof system for circuits), we achieve a notable conceptual advance: once-and-for-all key generation that allows verifying all programs up to a given size. This removes major issues in all prior systems: expensive per-program key generation, and the thorny issue of conducting it anew in a trusted way for every program.

Our circuit generator supports a universal machine that, like modern computers, follows the von Neumann paradigm (program and data lie in the same read/write address space). Concretely, it supports a von Neumann RISC architecture called \(\text{vnTinyRAM}\), a modification of \(\text{TinyRAM}\) [BCGTV13b]. Thus, we also support programs leveraging techniques such as just-in-time compilation or self-modifying code [GESA+09, RP06].

To compile C programs to the \(\text{vnTinyRAM}\) machine language, we ported the GCC compiler to this architecture, building on the work of [BCGTV13a].

See Figure 1 for a functionality comparison with prior circuit generators (for details, see [BFRS+13, §2]).

<table>
<thead>
<tr>
<th>Supported functionality</th>
<th>[SVPB+12]</th>
<th>[SBVB+13]</th>
<th>[PGHR13]</th>
<th>[BCGTV13a]</th>
<th>[BFRS+13]</th>
<th>this work</th>
</tr>
</thead>
<tbody>
<tr>
<td>side-effect free computations</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>data-dependent memory accesses</td>
<td>×</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>data-dependent control flow</td>
<td>×</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>self-modifying code</td>
<td>×</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>universality</td>
<td>×</td>
<td>×</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>

Figure 1: Comparison of the functionality supported by our and previous circuit generators.

(2) Our circuit generator efficiently handles larger arbitrary programs: the size of the generated circuit \(C_{\ell,n,T}\) in terms of the bounds \(\ell, n, T\), is

\[
O\left((\ell + n + T) \cdot \log(\ell + n + T)\right) \text{ gates.}
\]

Thus, the dependence on program size is additive, instead of multiplicative as in [BCGTV13a], where the generated (non-universal) circuit has size \(\Theta((n + T) \cdot (\log(n + T) + \ell))\). As Figure 2 shows, our efficiency improvement compared to [BCGTV13a] is not merely asymptotic but yields sizable concrete savings: as program size \(\ell\) increases, our amortized per-cycle gate count is essentially unchanged, while that of [BCGTV13a] grows without bound, becoming orders of magnitudes more expensive.

An efficiency comparison with other non-universal circuit generators [SVPB+12, SBVB+13, PGHR13, BFRS+13] is not well-defined. First, they support more restricted classes of programs, so a programmer must “write around” the limited functionality. Second, their efficiency is not easily specified, since the output circuit is ad hoc for the given program, and the only way to know its size is to actually run the circuit generator. We
Table 2: Per-cycle gate count improvements over [BCGTV13a].

<table>
<thead>
<tr>
<th>(n = 10^2)</th>
<th>(2^{20})</th>
<th>[BCGTV13a]</th>
<th>this work</th>
<th>improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td>(\ell = 10^3)</td>
<td>1,872</td>
<td>1.368</td>
<td>1.4×</td>
<td></td>
</tr>
<tr>
<td>(\ell = 10^4)</td>
<td>10,872</td>
<td>1.371</td>
<td>7.9×</td>
<td></td>
</tr>
<tr>
<td>(\ell = 10^5)</td>
<td>100,872</td>
<td>1.400</td>
<td>72.1×</td>
<td></td>
</tr>
<tr>
<td>(\ell = 10^6)</td>
<td>1,000,872</td>
<td>1.694</td>
<td>590.8×</td>
<td></td>
</tr>
</tbody>
</table>

Figure 2: Per-cycle gate count improvements over [BCGTV13a].

expect, and find, that such circuit generators perform better than ours for programs that are already “close to a circuit”, and worse for programs rich in data-dependent memory accesses and control flow.

1.4.2 A new zk-SNARK for circuits

Our third contribution is a high-performance implementation of a zk-SNARK for arithmetic circuits.

(3) We improve upon and implement the protocol of Parno et al. [PGHR13]. Unlike previous zk-SNARK implementations [PGHR13, BCGTV13a, BFRS+13], we do not use off-the-shelf cryptographic libraries. Rather, we create a tailored implementation of the requisite components: the underlying finite-field arithmetic, elliptic-curve group arithmetic, pairing-based checks, and so on.

To facilitate comparison with prior work, we instantiate our techniques for two specific algebraic setups: we provide an instantiation based on Edwards curves [Edw07] at 80 bits of security (as in [BCGTV13a]), and an instantiation based on Barreto–Naehrig curves [BN06] at 128 bits of security (as in [PGHR13, BFRS+13]).

On our reference platform (a typical desktop), proof verification is fast: at 80-bit security, for an \(n\)-byte input to the circuit, verification takes \(4.7 + 0.0004 \cdot n\) milliseconds, regardless of circuit size; at 128-bit security, it takes \(4.8 + 0.0005 \cdot n\). The constant term dominates for small inputs, and corresponds to the verifier’s pairing-based checks; in both cases, it is less than half the time for separately evaluating the 12 requisite pairings of the checks. We achieve this saving by merging parts of the pairings’ computation in a protocol-dependent way — another reason for a custom implementation of the underlying math.

Key generation and proof generation entail a per-gate cost. For example, for a circuit with 16 million gates: at 80 bits of security, key generation takes 81 \(\mu\)s per gate and proving takes 109 \(\mu\)s per gate; at 128 bits of security, these per-gate costs mildly increase to 100 \(\mu\)s and 144 \(\mu\)s.

As in previous zk-SNARK implementations, proofs have constant size (independent of the circuit or input size); for us, they are 230 bytes at 80 bits of security, and 288 bytes at 128 bits of security.

Compared to previous implementations of zk-SNARKs for circuits [PGHR13, BCGTV13a, BFRS+13], our implementation improves both proving and verification times, e.g., see Figure 3.

![Figure 3: Comparison with prior zk-SNARKs for a 1-million-gate arithmetic circuit and a 1000-bit input, running on our benchmarking machine, using software provided by the respective authors. Since [BFRS+13] is a re-implementation of [PGHR13], we only include the latter’s performance. (\(N = 5\) and \(\text{std} < 2\%\))]
PGHR13, BCGTV13a, BFRS+13], thus granting these systems universality: once-and-for-all key generation. Similarly, our zk-SNARK for circuits can replace the underlying zk-SNARKs in [PGHR13, BCGTV13a, BFRS+13], or be used directly in applications where a suitable circuit is already specified.

Combining these two components, we obtain a full system: a zk-SNARK for proving/verifying correctness of vnTinyRAM computations; see Figure 4 and Figure 5 for diagrams of this system. We evaluated this overall system for programs with up to 10,000 instructions, running for up to 32,000 steps. Verification time is, again, only few milliseconds, independent of the running time of the vnTinyRAM program, even when program size and input size are kilobytes. Proofs, as mentioned, have a small constant size. Key generation and proof generation entail a per-cycle cost, with a dependence on program size that “tapers off” as computation length increases. For instance, at 128-bit security and vnTinyRAM with a word size of 32 bits, key generation takes 210 ms per cycle and proving takes 100 ms per cycle, for 8K-instruction programs.

**Figure 4**: Offline phase (once). The key generator outputs a proving key and verification key, for proving and verifying correctness of any program execution meeting the given bounds.

**Figure 5**: Online phase (any number of times). The prover sends a short and easy-to-verify proof to a verifier. This can be repeated any number of times, each time for a different program and input.

**JIT case study: efficient memcpy.** Besides evaluating individual components, we give an example demonstrating the rich functionality supported by the integrated system. We wrote a vnTinyRAM implementation of memcpy that leverages just-in-time compilation (in particular, dynamic loop unrolling) to require fewer cycles. While ours is a simple case study, just-in-time compilation is a widely-used powerful technique with many applications, e.g., increasing the performance of interpreted programming languages such as JavaScript in web browsers [GESA+09] or Python [RP06]. As the efficiency of zk-SNARK implementations improves, more and more of these applications will become feasible.

1.5 Roadmap

In Section 2 we provide preliminaries. In Section 3 we describe our circuit generator. In Section 4 we describe our zk-SNARK for circuits. In Section 5 we evaluate our circuit generator and zk-SNARK, as well as the system resulting by combining the two. In Section 6 we conclude the paper.

2 Preliminaries

2.1 Notation

We denote by $\mathbb{F}$ a finite field and $\mathbb{F}_n$ is the field of size $n$; when $n$ is prime, we identify elements of $\mathbb{F}_n$ with integers modulo $n$. Field elements are denoted with Greek letters (e.g., $\alpha, \beta, \gamma$). We denote by $\mathbb{F}[z]$ the ring of univariate polynomials over $\mathbb{F}$, and by $\mathbb{F}^{\leq d}[z]$ the subring of polynomials of degree $\leq d$. Vectors are denoted by arrow-equipped letters (such as $\vec{a}$); their entries carry an index but not the arrow (e.g., $a_1$ or $a_2$). Concatenation of vectors (and scalars) is denoted by the operator $\circ$. 

7
2.2 Arithmetic circuits

The circuits that we consider are not boolean but arithmetic. Given a finite field $\mathbb{F}$, an $\mathbb{F}$-arithmetic circuit takes inputs that are elements in $\mathbb{F}$, and its gates output elements in $\mathbb{F}$. The circuits we consider only have bilinear gates.\footnote{A gate with inputs $x_1, \ldots, x_m \in \mathbb{F}$ is bilinear if the output is $(\langle \vec{a}, (1, x_1, \ldots, x_m) \rangle \cdot \langle \vec{b}, (1, x_1, \ldots, x_m) \rangle)$ for some $\vec{a}, \vec{b} \in \mathbb{F}^{m+1}$. In particular, these include addition, multiplication, and constant gates.} Arithmetic circuit satisfiability is analogous to the boolean case:

**Definition 2.1.** Let $n, h, l$ respectively denote the input, witness, and output size. The circuit satisfaction problem of a circuit $C : \mathbb{F}^n \times \mathbb{F}^h \rightarrow \mathbb{F}^l$ with bilinear gates is defined by the relation $R_C = \{ (\vec{x}, \vec{a}) \in \mathbb{F}^n \times \mathbb{F}^h : C(\vec{x}, \vec{a}) = 0 \}$ and its language is $L_C = \{ \vec{x} \in \mathbb{F}^n : \exists \vec{a} \in \mathbb{F}^h, C(\vec{x}, \vec{a}) = 0 \}$.

Our circuit generator reduces the correctness of program executions to arithmetic circuit satisfiability (see Section 3), and our zk-SNARK implementation produces/verifies proofs for this language (see Section 4).

All the arithmetic circuits we consider are over prime fields $\mathbb{F}_p$. In this case, when passing boolean strings as inputs to arithmetic circuits, we pack the string’s bits into as few field elements as possible: given $s \in \{0,1\}^m$, we use $[s]_p^m$ to denote the vector $\vec{x} \in \mathbb{F}_p^m$, where $|m|_p := \lfloor m / \log p \rfloor$, such that the binary representation of $x_i \in \mathbb{F}_p$ is the $i$-th block of $\lfloor \log p \rfloor$ bits in $s$ (padded with 0’s if needed). We extend the notation $[s]_p^m$ to binary strings $s \in \{0,1\}^n$ with $n < m$ bits via padding: $[s]_p^m := [s0^{m-n}]_p^m$.

2.3 Quadratic arithmetic programs

Our zk-SNARK leverages quadratic arithmetic programs (QAPs), introduced by Gennaro et al. [GGPR13].

**Definition 2.2.** A quadratic arithmetic program of size $m$ and degree $d$ over $\mathbb{F}$ is a tuple $(\vec{A}, \vec{B}, \vec{C}, Z)$, where $\vec{A}, \vec{B}, \vec{C}$ are three vectors, each of $m + 1$ polynomials in $\mathbb{F}^{d-1}[z]$, and $Z \in \mathbb{F}[z]$ has degree exactly $d$.

Like a circuit, a QAP induces a satisfaction problem:

**Definition 2.3.** The satisfaction problem of a size-$m$ QAP $(\vec{A}, \vec{B}, \vec{C}, Z)$ is the relation $R_{(\vec{A}, \vec{B}, \vec{C}, Z)}$ of pairs $(\vec{x}, \vec{s})$ such that (i) $\vec{x} \in \mathbb{F}^n$, $\vec{s} \in \mathbb{F}^m$, and $n \leq m$; (ii) $x_i = s_i$ for $i \in [n]$ (i.e., $\vec{s}$ extends $\vec{x}$); and (iii) the polynomial $Z(z)$ divides the following one:

$$(A_0(z) + \sum_{i=1}^m s_i A_i(z)) \cdot (B_0(z) + \sum_{i=1}^m s_i B_i(z)) - (C_0(z) + \sum_{i=1}^m s_i C_i(z)).$$

We denote by $L_{(\vec{A}, \vec{B}, \vec{C}, Z)}$ the language of $R_{(\vec{A}, \vec{B}, \vec{C}, Z)}$.

Gennaro et al. [GGPR13] showed that circuit satisfiability can be efficiently reduced to QAP satisfiability (which can then be proved and verified using zk-SNARKs):

**Lemma 2.4.** There exist two polynomial-time algorithms QAPinst, QAPwit that work as follows. For any circuit $C : \mathbb{F}^n \times \mathbb{F}^h \rightarrow \mathbb{F}^l$ with $a$ wires and $b$ gates, $(\vec{A}, \vec{B}, \vec{C}, Z) :=$ QAPinst$(C)$ is a QAP of size $m$ and degree $d$ over $\mathbb{F}$ that satisfies the following three properties.

- **Efficiency.** It holds that $m = a$ and $d = n + b + l + 1$.
- **Completeness.** For any $(\vec{x}, \vec{a}) \in R_C$, it holds that $(\vec{x}, \vec{s}) \in R_{(\vec{A}, \vec{B}, \vec{C}, Z)}$, where $\vec{s} :=$ QAPwit$(C, \vec{x}, \vec{a})$.
- **Proof of Knowledge.** For any $(\vec{x}, \vec{s}) \in R_{(\vec{A}, \vec{B}, \vec{C}, Z)}$, it holds that $(\vec{x}, \vec{a}) \in R_C$, where $\vec{a}$ is a prefix of $\vec{s}$.
- **Non-Degeneracy.** The polynomials $A_0, \ldots, A_n$ are linearly independent. Moreover, the intersection of the span of $A_0, \ldots, A_n$ and the span of $A_{n+1}, \ldots, A_m$ is trivial (only contains the zero polynomial).
Proof sketch. The third condition in Definition 2.3 implies that \( \langle 1 \circ \vec{s}, \vec{A}(\omega) \rangle \cdot \langle 1 \circ \vec{s}, \vec{B}(\omega) \rangle = \langle 1 \circ \vec{s}, \vec{C}(\omega) \rangle \) for all roots \( \omega \) of \( Z \). In other words, membership in \( R_{(\vec{A}, \vec{B}, \vec{C}, Z)} \) is characterized by deg \( Z = d \) rank-1 quadratic constraints in the variable \( \vec{s} \), and we can choose these constraints by choosing coefficients for the polynomials \( \vec{A}, \vec{B}, \vec{C} \). We use \( b + l \) constraints to encode the satisfiability of the arithmetic circuit \( C \): one constraint per gate (enforcing its correct evaluation) and one constraint per circuit output (enforcing it to be zero). We then use an additional \( 1 + n \) constraints to meet the non-degeneracy condition: \( 1 \cdot 0 = 0 \) and \( s_i \cdot 0 = 0 \) for \( i = 1, \ldots, n^3 \). \( \square \)

Remark 2.5. The authors thank Ariel Gabizon [Gab19] and Bryan Parno for identifying a problem in an earlier version of Lemma 2.4. Previously, the lemma’s “non-degeneracy” condition merely required that \( A_0, \ldots, A_n \) be non-zero and distinct, which did not suffice for the security of the protocol in Appendix B. The current non-degeneracy condition yields a QAP degree of \( d = n + b + l + 1 \) rather than \( d = b + l + 1 \).

The degree increase is negligible for all applications reported in this paper and [BCGG + 14, BCTV14, CTV15, BCGTV15] because \( n \ll b \) (typically, \( n/b < 10^{-4} \)), leaving performance results unaffected. A similar negligible increase holds for any circuit in which the number of inputs \( n \) is small compared to the number of gates \( b \), as is the case, e.g., for many circuits that verify cryptographic computations.

2.4 Pairings

The zk-SNARK constructions that we consider are based on cryptographic pairings, which we now introduce.

Let \( G_1 \) and \( G_2 \) be two cyclic groups of order \( r \). We denote elements of \( G_1, G_2 \) via calligraphic letters such as \( P, Q \). We write \( G_1 \) and \( G_2 \) in additive notation. Let \( P_1 \) be a generator of \( G_1 \), i.e., \( \{\alpha P_1\}_{\alpha \in F_r} \) (\( \alpha \) is also viewed as an integer, hence \( \alpha P_1 \) is well-defined); let \( P_2 \) be a generator for \( G_2 \). A pairing is an efficient map \( e: G_1 \times G_2 \to G_T \), where \( G_T \) is also a cyclic group of order \( r \) (which we write in multiplicative notation), satisfying the following properties: (i) bilinearity: for every nonzero elements \( \alpha, \beta \in F_r \), it holds that \( e(\alpha P_1, \beta P_2) = e(P_1, P_2)^{\alpha \beta} \); (ii) non-degeneracy: \( e(P_1, P_2) \) is not the identity in \( G_T \).

For high-level discussions of zk-SNARK constructions, the choice of instantiation of \( G_1, G_2, G_T \), as well as the choice of pairing \( e \), does not matter. However, later, when discussing optimizations in our implementation (see Section 4), these choices matter a great deal.

2.5 zk-SNARKs for arithmetic circuits

A (preprocessing) zk-SNARK for \( F \)-arithmetic circuit satisfiability (see, e.g., [BCIOP13]) is a triple of polynomial-time algorithms \( (G, P, V) \), called key generator, prover, and verifier. The key generator \( G \), given a security parameter \( \lambda \) and an \( F \)-arithmetic circuit \( C : F^n \times F^h \to F^l \), samples a proving key \( pk \) and a verification key \( vk \); these are the proof system’s public parameters, which need to be generated only once per circuit. After that, anyone can use \( pk \) to generate non-interactive proofs for the language \( L_C \), and anyone can use the \( vk \) to check these proofs. Namely, given \( pk \) and any \( (\vec{x}, \vec{a}) \in R_C \), the honest prover \( P(pk, \vec{x}, \vec{a}) \) produces a proof \( \pi \) attesting that \( \vec{x} \in L_C \); the verifier \( V(vk, \vec{x}, \pi) \) checks that \( \pi \) is a valid proof for \( \vec{x} \in L_C \). A proof \( \pi \) is both a proof of knowledge, and a (statistical) zero-knowledge proof. The succinctness property requires that \( \pi \) has length \( O_\lambda(1) \) and \( V \) runs in time \( O_\lambda(|\vec{x}|) \), where \( O_\lambda \) hides a (fixed) polynomial in \( \lambda \).

Constructions. Several zk-SNARK constructions are known [Gro10a, Lip12, GGPR13, BCIOP13, PGHR13 BCGTV13a, Lip13]. The most efficient ones are based on quadratic span programs (QSPs) [GGPR13, Lip13] or quadratic arithmetic programs (QAPs) [GGPR13, BCIOP13, PGHR13 BCGTV13a]. We focused on QAP-based constructions, because QAPs allow for tighter reductions from arithmetic circuits (see Lemma 2.4). Concretely, we build on the QAP-based zk-SNARK protocol of Parno et al. [PGHR13] (see Section 4).
Remark 2.6 (full succinctness). The key generator $G$ takes $C$ as input, and so its complexity is linear in $|C|$. One could require $G$ to not take $C$ as input, and have its output keys work for all (polynomial-size) circuits $C$; then, $G$’s running time would be independent of $C$. A zk-SNARK satisfying this stronger property is fully succinct. Theoretical constructions of such zk-SNARKs are known, based on various cryptographic assumptions \cite{Mic00, Val08, BCCT13}. Despite achieving essentially-optimal asymptotics \cite{BFLS91, BHSV05, BCGT13b, BCTV14} no implementations of them have been reported to date.\footnote{In concurrent work, Ben-Sasson et al. \cite{BCTV14} build a fully-succinct zk-SNARK, by following the approach of \cite{BCCT13}. See \cite{BCTV14} for a discussion about the tradeoffs between our construction and theirs.}

2.6 A von Neumann RISC architecture

Ben-Sasson et al. \cite{BCGTV13a} introduced TinyRAM, a Harvard RISC architecture with word-addressable memory. We modify TinyRAM to obtain vnTinyRAM, which differs from it in two main ways. First, vnTinyRAM follows the von Neumann paradigm, whereby program and data are stored in the same read-write address space; programs may use runtime code generation. Second, vnTinyRAM has byte-addressable memory, along with instructions to load/store bytes or words.\footnote{Byte-addressing is common in programs performing array or string operations (and is a deeply-ingrained assumption in the GCC and LLVM compilers), while word-addressing in programs performing arithmetic. Simultaneous support for both greatly simplifies compiling higher-level languages to vnTinyRAM.}

Besides the above main differences, vnTinyRAM is very similar to TinyRAM. Namely, it is parametrized by the word size, denoted $W$, and the number of registers, denoted $K$. The CPU state of the machine consists of (i) a $W$-bit program counter; (ii) $K$ general-purpose $W$-bit registers; (iii) a 1-bit condition flag. The full state of the machine also includes memory, which is a linear array of $2^W$ bytes, and two tapes, each with a string of $W$-bit words, and read-only in one direction. One tape is for a primary input $x$ and the other for an auxiliary input $w$ (treated as nondeterministic, untrusted advice).

In memory, an instruction is represented as a double word (one word for an immediate, and another for opcode, etc.). Thus, a program $P$ is a list of address/double-word pairs specifying the initial contents of memory; all other memory locations assume the initial value of 0.

At every time step, the machine executes the instruction encoded by pc-th double word in memory, where pc is program counter pc (with its lowest $2W/8$ set to 0); every instruction increments pc by $2W/8$ (which is number of bytes in a double word), unless it explicitly modifies pc. The machine’s only input is via the input tapes and initial memory, and only output is via an answer instruction (which halts execution) having a single argument $A$, representing the return value, where $A = 0$ means “accept”.

Language of accepting computations. Formally, when saying “prover/verify correct execution” we mean “membership in the language of accepting computations”. This language is defined as follows.

Definition 2.7. Fix bounds $\ell, n, T$. The language $L_{\ell,n,T}$ consists of pairs $(P, z)$ such that: (i) $P$ is a program with $\leq \ell$ instructions, (ii) $z$ is a primary input with $\leq n$ words, (iii) there exists an auxiliary input $w$ s.t. $P(z, w)$ accepts in $\leq T$ steps. We denote by $R_{\ell,n,T}$ the relation corresponding to $L_{\ell,n,T}$.

For more details about vnTinyRAM, see \cite{BCGT13b}.
**Definition 3.1.** A (universal) circuit generator of efficiency $f(\cdot)$ over a prime field $\mathbb{F}_p$ is a polynomial-time algorithm $\text{circ}$, together with an efficient witness map $\text{wit}$, working as follows. For any program size bound $\ell$, time bound $T$, and primary-input size bound $n$, $C := \text{circ}(\ell, n, T)$ is an $\mathbb{F}_p$-arithmetic circuit $C : \mathbb{F}_p^m \times \mathbb{F}_p^h \rightarrow \mathbb{F}_p$ for $m := |2W|_p + |nW|_p$ and some $h, \ell$, where $W$ is the word size (cf. Section 2.6).

- **Efficiency.** The circuit $C$ has $f(\ell, n, T)$ gates.
- **Completeness.** Given any program $P$, primary input $\vec{x}$, and witness $\vec{w}$ such that $(P, \vec{x}, \vec{w}) \in \mathcal{R}_{\ell,n,T}$, it holds that $(\vec{x}, \vec{a}) \in \mathcal{R}_C$, where $\vec{x} := [P]_p^{2W} \circ [\vec{x}]_p^n W$ and $\vec{a} := \text{wit}(\ell, n, T, P, \vec{x}, \vec{w})$.
- **Proof of Knowledge.** There is a polynomial-time algorithm such that, given any $(\vec{x}, \vec{a}) \in \mathcal{R}_C$, outputs a witness $\vec{w}$ for $(P, \vec{x}) \in \mathcal{L}_{\ell,n,T}$.

The circuit $C$ output by $\text{circ}$ is universal because it does not depend on the program $P$ or primary input $\vec{x}$, but only on their respective size bounds $\ell$ and $n$ (as well as the time bound $T$). When combined with any proof system for circuit satisfiability (e.g., our zk-SNARK), this fact enables the generation of the proof systems’ parameters to be universal as well. Namely, it is possible to generate keys for all bound choices (e.g., in powers of 2) up to some constant, once and for all; afterwards, one can pick the keys corresponding to bounds fitting a given computation. This avoids expensive per-program key generation and, more importantly, the need for a trusted party to conduct key generation anew for every program.

We construct a universal circuit generator with the following efficiency:

**Theorem 3.2.** There is a circuit generator of efficiency $f(\ell, n, T) = O((\ell + n + T) \cdot \log(\ell + n + T))$ over any prime field $\mathbb{F}_p$ of size $p > 2^{2W}$, where $W$ is the word size (cf. Section 2.6).

**Remark 3.3.** The prime $p$ is determined by the zk-SNARK construction with which the circuit generator is combined, and in our case is at least 160 bits (so that inverting discrete logarithms in related groups is hard). Thus, the condition $p > 2^{2W}$ is not really a restriction, even for large word sizes (e.g., $W = 64$). Regardless, Theorem 3.2 in fact holds for any field $\mathbb{F}$, but the construction, when $\text{char}(\mathbb{F}) \leq 2^{2W}$, is more complex, and our code does not currently support it.

### 3.1 Past techniques

Most of the difficulties that arise when designing a circuit generator have to do with data dependencies. A circuit’s topology does not depend on its inputs but, in contrast, program flow and memory accesses depend on the choice of program and the program’s inputs. Thus, a circuit tasked with verifying program executions must be “ready” to support a multitude of program flows and memory accesses, despite the fact that its topology has already been fixed. Various techniques have been applied to the design of circuit generators.

**Program analysis.** In the extreme, if both the program $P$ and its inputs $(\vec{x}, \vec{w})$ are known in advance, designing a circuit generator is simple: construct a circuit that evaluates $P$ on $(\vec{x}, \vec{w})$ by preparing the circuit’s topology to match the pre-determined program flow and memory accesses. But now suppose that only $P$ is known in advance, but not its inputs $(\vec{x}, \vec{w})$. In this case, by analyzing $P$ piece by piece (e.g., separately examine the various loops, branches, and so on), one could try to design a circuit $C_P$ that can handle different choices of inputs. Most prior circuit generators [SVPB$^+$12, SBVB$^+$13, PGHR13, BFRS$^+$13] take this approach.

However, this approach suffers from several limitations. First, the class of supported programs $P$ is not rich, because support for data dependencies is limited. E.g., [PGHR13] requires array accesses and loop iteration bounds to be compile-time constants; also, while [BFRS$^+$13] supports data-dependent memory accesses, most program flow is also restricted to be known (or bounded) at compile-time; mitigations are possible, but only in special cases [ZE13]. Second, and more importantly, this approach does not seem to allow for designing universal circuit generators, because the program $P$ is not known in advance and thus there is no program to analyze.
Multiplex every access. Computers are universal random-access machines (RAMs), so one approach of designing a universal circuit is to mimic a computer’s execution, building a layered circuit as follows. The $i$-th layer contains the entire state of the machine (CPU state and random-access memory) at time step $i$, and layer $i+1$ is computed from it by evaluating the transition function of the machine, handling any accesses to memory via multiplexing. While this approach supports arbitrary program flow, memory accesses are inefficiently supported; indeed, if memory has $S$ addresses, the resulting circuit is huge: it has size $\Omega(TS)$.

Nondeterministic routing. Ben-Sasson et al. [BCGT13a] suggested using nondeterministic routing on a Beneš network to support memory accesses efficiently; indeed, sorting and routing are ubiquitous techniques in fast simulation results between nondeterministic models of computation [Ofm65, Sch78, GS89, Rob91]. Our circuit generator builds on the techniques of [BCGT13a, BCGTV13a], so we briefly review the main idea behind nondeterministic routing.

Following [BCGT13a], Ben-Sasson et al. [BCGTV13a] introduced a simple computer architecture, called TinyRAM, and constructed a routing-based circuit generator for TinyRAM. They define the following notions. A CPU state, denoted $S$, is the CPU’s contents (e.g., program counter, registers, flags) at a given time step. An execution trace for a program $P$, time bound $T$, and primary input $x$ is a sequence $tr = (S_1, \ldots, S_T)$ of CPU states. An execution trace $tr$ is valid if there is an auxiliary input $\omega$ such that the execution trace induced by $P$ running on inputs $(x, \omega)$ is $tr$.

We seek an arithmetic circuit $C$ for verifying that $tr$ is valid. We break this down by splitting validity into three sub-properties: (i) validity of instruction fetch (for each time step, the correct instruction is fetched); (ii) validity of instruction execution (for each time step, the fetched instruction is correctly executed); and (iii) validity of memory accesses (each load from an address retrieves the value of the last store to that address).

The first two properties are verified as follows. Construct a circuit $C_P$ so that, for any two CPU states $S$ and $S'$, $C_P(S, S', g)$ is satisfied for some “guess” $g$ if and only if $S'$ can be reached from $S$ (by fetching from $P$ the instruction indicated by the program counter in $S$ and then executing it), for some state of memory. Then, properties (i) and (ii) hold if $C_P(S_i, S_{i+1}, \cdot)$ is satisfiable for $i = 1, \ldots, T-1$. Thus, $C$ contains $T-1$ copies of $C_P$, each wired to a pair of adjacent states in $tr$.

The third property is verified via nondeterministic routing. Assume that $C$ also gets as input $\text{MemSort}(tr)$, which equals to the sorting of accessed memory addresses (breaking ties via timestamps), and write a circuit $C_{\text{mem}}$ so that validity of memory accesses holds if $C_{\text{mem}}$ is satisfied by each pair of adjacent states in $\text{MemSort}(tr)$. (Roughly, $C_{\text{mem}}$ checks consistency of “load-after-load”, “load-after-store”, and so on.) However, $C$ merely gets some auxiliary input $tr^*$, which purports to be $\text{MemSort}(tr)$. So $C$ works as follows: (a) $C$ has $T-1$ copies of $C_{\text{mem}}$, each wired to a pair of adjacent states in $tr^*$; (b) $C$ separately verifies that $tr^* = \text{MemSort}(tr)$ by routing on a $O(T \log T)$-node Beneš network. The switches of the routing network are set according to non-deterministic guesses (i.e., additional values in the auxiliary input), and the routing network merely verifies that the switch settings induce a permutation; this allows for a very tight reduction. (Known constructions that compute the correct permutation hide large constants in big-oh notation [AKS83].)

Past inefficiencies. After filling in additional details, the construction of [BCGTV13a] reviewed above gives a circuit of size $\Theta((n + T) \cdot (\log(n + T) + \ell)) = \Omega(\ell \cdot T)$. The $\Omega(\ell \cdot T)$ arises from the fact that all of the $\ell$ instructions in $P$ are hardcoded into each of the $T-1$ copies of $C_P$. Thus, besides being non-universal, the circuit scales inefficiently as $\ell$ grows (e.g., for $\ell = 10^4$, $C_P$’s size is already dominated by $P$’s size).

3.2 Our construction

In comparison to [BCGTV13a], our circuit generator is universal and, moreover, its size only grows with $\ell + T$ (additive dependence on program size) instead of with $\ell \cdot T$ (multiplicative dependence). As our evaluation demonstrates (see Section 5.1), the size improvement actually translates into significant savings in practice.
Instead of hardcoding the program $P$ into each copy of the circuit $C_P$, we follow the von Neumann paradigm, where the program $P$ lies in the same read/write memory space as data. We ensure that $P$ is loaded into the initial state of memory, using a dedicated circuit; we then verify instruction fetch via the same routing network that is used for checking data loads/stores. While the idea is intuitive, realizing it involves numerous technical difficulties, some of which are described below.

**Routing instructions and data.** We extend an execution trace to not only include CPU states but also instructions: $tr = (S_1, I_1, \ldots, S_T, I_T)$ where $S_i$ is the $i$-th CPU state, and $I_i$ is the $i$-th executed instruction. We seek an arithmetic circuit $C$ that checks $tr$, in this “extended” format, for the same three properties as above: (i) validity of instruction fetch; (ii) validity of instruction execution; (iii) validity of memory accesses.

As in [BCGTV13a], checking that $tr$ satisfies property (ii) is quite straightforward. Construct a circuit $C_{exe}$ so that, given two CPU states $S, S'$ and an instruction $I$, $C_{exe}(S, S', I, g)$ is satisfied, for some guess $g$, if and only if $S'$ can be reached from $S$, by executing $I$, for some state of memory. Then, $C$ contains $T - 1$ copies of $C_{exe}$, each wired to adjacent CPU states and an instruction, i.e., the $i$-th copy is $C_{exe}(S_i, S_{i+1}, I_i, g_i)$.

Unlike [BCGTV13a], though, we verify properties (i) and (iii) jointly, via the same routing network. The auxiliary input now contains $tr^* = (A_1, \ldots, A_{2T})$, purportedly equal to the memory-sorted list of both instructions fetches and CPU states. (Since the program $P$ lies in the same read-write memory as data, an instruction fetch from $P$ is merely a special type of memory load.) Thus, to check that $tr$ satisfies properties (i) and (iii), we design $C$ to (a) verify that $tr^* = \text{MemSort}(tr)$ via nondeterministic routing, and (b) verify validity of all (i.e., instruction and data) memory accesses, via a new circuit $C_{mem}^*$ applied to each pair of adjacent items $A_i, A_{i+1}$ in $tr^*$. Thus, in this approach, $P$ is never replicated $T$ times; rather, the fetching of its instructions is verified together with all other memory accesses, one instruction fetch at a time.

**Multiple memory-access types.** Each copy of $C_{mem}^*$ inspects a pair of items in $tr^*$ and (assuming $tr^* = \text{MemSort}(tr)$) must ensure correctness of “load-after-load”, “load-after-store”, and so on. However, unlike in [BCGTV13a], the byte-addressable memory of vnTinyRAM is accessed in different-sized blocks: instruction-size blocks for instruction fetch; word-size blocks when loading/storing words; and byte-size blocks when loading/storing bytes. The consistency checks in $C_{mem}^*$ must handle “aliasing”, i.e., accesses to the same point in memory via different addresses and block sizes.

We tackle this difficulty as follows. Double-word blocks are the largest blocks in which memory is accessed (as instructions are encoded as double words; cf. Section 2.6). We thus let each item in $tr^*$ always specify a double-word, even if the item’s memory access was with respect to a smaller-sized block (e.g., word or byte). With this modification, we can let $C_{mem}^*$ perform consistency checks “at the double-word level”, and handling word/byte accesses by mapping them to double-word accesses with suitable shifting and masking.

**Booting the machine.** We have so far assumed that the program $P$, given as input to $C$, already appears in memory. However, the circuit $C$ is designed so far only verifies the validity of $tr$ with respect to a machine whose memory is initialized to some state, corresponding to the execution of some program. But $C$ must verify correct execution of, specifically, $P$, and so it must also verify that memory is initialized to contain $P$. Since $C$ does not explicitly maintain memory (not even the initial one) and only implicitly reasons about memory via the routing network, it is not clear how $C$ can perform this check.

We tackle this difficulty as follows. We further modify the the execution trace $tr$, by extending it with an initial boot section, preceding the beginning of the computation, during which the input program $P$ is stored into memory, one instruction $P_i$ at a time. This extends the length of both $tr$ and $tr^*$ from $2T$ to $\ell + 2T$, for $\ell$-instruction programs, and introduces a new type of item, “boot input store”, in $tr^*$. Similarly, the routing network is now responsible for routing $\ell + 2T$, rather than $2T$, packets.

**Further optimizations.** The above construction sketch (depicted in Figure 6) is only intuitive, and does not discuss other optimizations that ultimately yield the performance that we report in Section 5.1.

For example, while [BCGTV13a] rely on Beneš networks, we rely on arbitrary-size Waksman networks [BD02], which only require $N(\log N - 0.91)$ switches to route $N$ packets, instead of $2^{\log^N N}((\log N) - 0.5)$.
Besides being closer to the information-theoretic lower bound of $N(\log N - 1.443)$, such networks eliminate costly rounding effects in [BCGTV13a], where the size of the network is doubled if $N$ is just above a power of 2 (since the height of a Benes network is $2^{\lceil \log N \rceil}$).

As another example, we want $C$ to not only support programs with exactly $\ell$ instructions but also with at most $\ell$, and similarly for the bound $n$ on the size of primary inputs (which our discussion has so far omitted); we work out the details for $C$ to efficiently support such upper bounds.

Compiling to vnTinyRAM. To enable verification of higher-level programs, written in C, we ported the GCC compiler to the vnTinyRAM architecture, by modifying the Harvard-architecture, word-addressible TinyRAM C compiler of [BCGTV13a]. Given a C program, written in the same subset of C as in [BCGTV13a], the compiler produces the initial memory map representing a program $P$. This also served to validate the vnTinyRAM architectural choices (e.g., the move to byte-addressing significantly, and added instructions, improved efficiency for many programs).

![Outline of our universal circuit construction with the extended trace $tr$ on the left and (allegedly) its memory sort $tr^*$ on the right. All inputs to the circuit, with the exception of $P$ (and the primary input $x$, not shown), are nondeterministic guesses.](image)

Figure 6: Outline of our universal circuit construction with the extended trace $tr$ on the left and (allegedly) its memory sort $tr^*$ on the right. All inputs to the circuit, with the exception of $P$ (and the primary input $x$, not shown), are nondeterministic guesses.

### 4 Our zk-SNARK for circuits

We discuss our second main contribution: a high-performance implementation of a zk-SNARK for arithmetic circuit satisfiability. Our approach is to tailor the requisite mathematical algorithms to the specific zk-SNARK protocol at hand. While our techniques can be instantiated in many algebraic setups and security levels, we demonstrate them in two specific settings, to facilitate comparison with prior work. Later, in Section 5.2, we provide benchmarks for our zk-SNARK.

#### 4.1 The PGHR protocol and the two elliptic curves

See Section 2.5 for an informal definition of a zk-SNARK for arithmetic circuit satisfiability. We improve upon and implement the zk-SNARK of Parno et al. [PGHR13]. For completeness the “PGHR protocol” is summarized in Figure 10, which provides pseudocode for its key generator $G$, prover $P$, and verifier $V$. The construction is based on QAPs, introduced in Section 2.3.

Like most other zk-SNARKs, the PGHR protocol relies on a pairing, which is specified by a prime $r \in \mathbb{N}$, three cyclic groups $G_1, G_2, G_T$ of order $r$, and a bilinear map $e: G_1 \times G_2 \rightarrow G_T$. (See Section 2.4.)
A pairing is typically instantiated via a pairing-friendly elliptic curve. Concretely, suppose that one uses a curve $E$ defined over $\mathbb{F}_q$, with embedding degree $k$ with respect to $r$, to instantiate the pairing. Then $\mathbb{G}_T$ is set to $\mu_r$, the subgroup of $r$-th roots of unity in $\mathbb{F}_{q^k}^\times$. The instantiation of $\mathbb{G}_1$ and $\mathbb{G}_2$ depends on the choice of $e$; typically, $\mathbb{G}_1$ is instantiated as an order-$r$ subgroup of $E(\mathbb{F}_q)$, while, for efficiency reasons [BKL02, BLS04], $\mathbb{G}_2$ as an order-$r$ subgroup of $E'(\mathbb{F}_{q^d})$ where $E'$ is a $d$-th twist of $E$. Finally, the pairing $e$ is typically a two-stage function $e(P, Q) := \text{FE}(\text{ML}(P, Q))$, where $\text{ML} : \mathbb{G}_1 \times \mathbb{G}_2 \to \mathbb{F}_q^\times$ is known as Miller loop, and $\text{FE} : \mathbb{F}_q^\times \to \mathbb{F}^\times_q$ is known as final exponentiation and maps $\alpha$ to $\alpha^{(q-1)/r}$.

As mentioned, we instantiate our techniques based on two different curves: an Edwards curve for the 80-bit security level (as in [BCGTV13a]) and a Barreto–Naehrig curve for the 128-bits security level (as in [PGHR13, BFRS+13]). We selected both the Edwards curve and Barreto–Naehrig curve so that $r − 1$ has high 2-adic order (i.e., $r − 1$ is divisible by a large power of 2), because this was shown to improve the efficiency of the key generator and the prover [BCGTV13a].

Next, we describe the optimizations that we have applied to the zk-SNARK verifier (Section 4.2), then to the prover (Section 4.3), and, finally, to the key generator (Section 4.4).

4.2 An optimized verifier

The verifier $V$ takes as input a verification key $\mathbf{vk}$, input $\mathbf{x} \in \mathbb{F}_r^m$, and proof $\pi$, and checks if $\pi$ is a valid proof for the statement “$\mathbf{x} \in \mathcal{L}_C$”. The computation of $V$ consists of two parts. First, use $\mathbf{vk}_{i,0}, \ldots, \mathbf{vk}_{i,n} \in \mathbb{G}_1$ (part of $\mathbf{vk}$) and input $\mathbf{x}$ to compute $\mathbf{vk}_{i,x} := \mathbf{vk}_{i,0} \cdot \sum_{i=1}^{n} x_i \mathbf{vk}_{i,i}$ (see Step 1 in Figure 10c). Second, use $\mathbf{vk}$, $\mathbf{vk}_{i,x}$, and $\pi$, to compute 12 pairings and perform the required checks (see Step 2, Step 3, Step 4 in Figure 10c). In other words, $V$ performs $O(n)$ scalar multiplications in $\mathbb{G}_1$, followed by $O(1)$ pairing evaluations.

With regard to $V$’s first part, variable-base multi-scalar multiplication techniques can be used to reduce the number of $\mathbb{G}_1$ operations needed to compute $\mathbf{vk}_{i,x}$ [BCGTV13a, PGHR13]. With regard to $V$’s second part, even if the pairing evaluations take constant time (independent of the input size $n$), these evaluations are very expensive and dominate for small $n$. Our focus here is to minimize the cost of these pairing evaluations.

When only making “black-box” use of a pairing, the verifier must evaluate 12 pairings (see Figure 10c), amounting to 12 Miller loops plus 12 final exponentiations. The straightforward approach is to compute these using a generic high-performance pairing library. We proceed differently: we obtain high-performance implementations of sub-components of a pairing, and then tailor their use specifically to $V$’s protocol.

Namely, first, we obtain state-of-the-art implementations of a Miller loop and final exponentiation. We utilize optimal pairings [Ver10] to minimize the number of loop iterations in each Miller loop, and, to efficiently evaluate each Miller loop, rely on the formulas of [ALNR11] (for Edwards curves) and [BGDMO+10] (for BN curves). As for final exponentiation, we use multiple techniques to speed it up: [SBCDPK09, GS10, FCnKRH12, KKC13].

Next, building on the above foundation, we incorporate in $V$ the following optimizations.

1) Sharing Miller loops and final exponentiations. The verifier $V$ computes two products of two pairings (see Step 3 and Step 4 in Figure 10c). We leverage the fact that a product of pairings can be evaluated faster than evaluating each pairing separately and then multiplying the results [Sol03, Sco05, GS06, Sco07]. Concretely, in a product of $m$ pairings, the Miller loop iterations for evaluating each factor can be carried out in “lock-step” so to share a single Miller accumulator variable, using one $\mathbb{F}_{q^d}$ squaring per loop instead of $m$.

In a similar vein, one can perform a single final exponentiation on the product of the outputs of the $m$ Miller loops, instead of $m$ final exponentiations and then multiplying the results. In fact, since the output of the pairing can be inverted for free (as the element is unitary so that inverting equals conjugating [SB04]), the idea of “sharing” final exponentiations extends to a ratio of pairing products. Thus, in the verifier we only need to perform 5, instead of 12, final exponentiations.

Our implementation incorporates both of the above techniques. For example, at the 80-bit security level, separately computing 12 optimal pairings costs 13.6 ms, but the above techniques reduce the time to only
8.1 ms. We decrease this further as discussed next.

(2) Precomputation by processing the verification key. Of the 12 pairings the verifier needs to evaluate, only one is such that both of its inputs come from the proof \( \pi \). The other 11 pairings have one fixed input, either a generator of \( G_1 \) or \( G_2 \), or coming from the verification key \( v_k \).

Whenever one of the two inputs to a pairing is fixed, precomputation techniques apply [GHS02, BLS03, Sco07], especially in the case when the fixed input is the base point in Miller’s algorithm. In \( V \), this holds for 9 out of the 11 pairing evaluations. We thus split the verifier’s computation into an offline phase, which consists of a one-time precomputation that only depends on \( v_k \), and a many-time online phase, which depends on the precomputed values, input \( \vec{x} \), and proof \( \pi \). More precisely, the result of the offline phase is a processed verification key \( v_k^* \). While \( v_k^* \) is longer than \( v_k \), it allows the online phase to be faster.

E.g., at the 80-bit security level, \( v_k^* \) decreases the total cost of pairing checks from 8.1 ms to 4.7 ms.

4.3 An optimized prover

The prover \( P \) takes as input a proving key \( pk \) (which includes the circuit \( C : \mathbb{F}_r^n \times \mathbb{F}_r^h \rightarrow \mathbb{F}_r^l \)), input \( \vec{x} \in \mathbb{F}_r^n \), and witness \( \vec{a} \in \mathbb{F}_r^h \). The prover \( P \) is tasked to produce a proof \( \pi \), attesting that \( \vec{x} \in \mathcal{C} \).

The computation of \( P \) consists of two main parts. First, compute the coefficients \( \vec{h} \) of the polynomial \( H(z) := \frac{A(z)B(z) - C(z)}{Z(z)} \) (see Step 4 in Figure 10b), where \( A, B, C \in \mathbb{F}_r[z] \) are derived from the QAP instance \((\vec{A}, \vec{B}, \vec{C}, Z) := \text{QAPinst}(C) \) and QAP witness \( \vec{s} := \text{QAPwit}(C, \vec{x}, \vec{a}) \). Second, use the coefficients \( \vec{h} \), QAP witness \( \vec{s} \), and public key \( pk \) to compute \( \pi \) (see Step 6 in Figure 10b).

With regard to the first part of \( P \), the coefficients \( \vec{h} \) can be efficiently computed via FFT techniques [BCGT13a, PGHR13]; our implementation follows [BCGT13a], and leverages the high 2-adic order of \( r - 1 \) for both of the elliptic curves we use.\(^6\) With regard to \( P \)’s second part, computing \( \pi \) requires solving large instances of the following problem: given elements \( Q_1, \ldots, Q_n \) all in \( G_1 \) (or all in \( G_2 \)) and scalars \( \alpha_1, \ldots, \alpha_n \in \mathbb{F}_r \), compute \( \langle \vec{\alpha}, \vec{Q} \rangle := \alpha_1 Q_1 + \cdots + \alpha_n Q_n \). Previous work [PGHR13, BCGT13a] has leveraged generic multi-scalar multiplication to compute \( \pi \). We observe that these algorithms can be tailored to the specific scalar distributions arising in \( P \). In \( P \), the vector \( \vec{\alpha} \) is one of two types: (i) \( \vec{\alpha} \in \mathbb{F}_r^{d+1} \) and represents the coefficients of the degree-\( d \) polynomial \( H \); or (ii) \( \vec{\alpha} = (1 \circ \vec{s} \circ \delta_1 \circ \delta_2 \circ \delta_3) \in \mathbb{F}_r^{4 + m} \), for random \( \delta_1, \delta_2, \delta_3 \in \mathbb{F}_r^\times \).

In case i, the entries in of \( \vec{\alpha} \) are random-looking. We use the Bos–Coster algorithm [BC89] due to its lesser memory requirements (as compared to, e.g., [Pip80]). We follow [BDLSY11]’s suggestions and achieve an assembly-optimized heap to implement the Bos–Coster algorithm.

In case ii, the entries in \( \vec{s} \) depend on the input \((C, \vec{x}, \vec{a})\) to QAPwit; in turn, \((C, \vec{x}, \vec{a}) \) depends on our circuit generator (Section 3). Using the above algorithm “as is” is inefficient: the algorithm works well when all the scalars have roughly the same bit complexity, but the entries in \( \vec{\alpha} \) have very different bit complexity. Indeed, \( \vec{\alpha} \) equals to \( \vec{s} \) augmented with a few entries; and \( \vec{s} \), the QAP witness, can be thought of as the list of wire values in \( C \) when computing on \((\vec{x}, \vec{a})\); the bit complexity of a wire value depends on whether it is storing a boolean value, a word value, and so on. We observe that there are only a few “types” of values, so that the entries of \( \vec{\alpha} \) can be clustered into few groups of scalars with approximately the same bit complexity; we then apply the algorithm of [BC89] to each such group.

4.4 An optimized key generator

The key generator \( G \) takes as input a circuit \( C : \mathbb{F}_r^n \times \mathbb{F}_r^h \rightarrow \mathbb{F}_r^l \), and is tasked to compute a proving key \( pk \) and a verification key \( vk \). The computation of \( G \) consists of two main parts. First, evaluate each \( A_i, B_i, C_i \) at

\(^6\)If the 2-adic order of \( r - 1 \) is \( i \) then \( \mathbb{F}_r \) contains a primitive root of unity of order \( 2^i \). Hence, one can use the classical radix-2 multiplicative FFT [CT65] and its inverse over domains of size \( 2^i \). These algorithms only require \( O(n \log n) \) field operations for degree-\( n \) polynomials, and are particularly efficient in practice.
a random element $r$, where $(\vec{A}, \vec{B}, \vec{C}, Z) := \text{QAPinst}(C)$ is the QAP instance. Second, use these evaluations to compute $pk$ and $vk$ (see Step 3 and Step 4 in Figure 10a).

With regard to $G$’s first part, we follow [BCGTV13a] and again leverage the fact that $\mathbb{F}_r$ has a primitive root of unity of large order. With regard to $G$’s second part, it is dominated by the cost of computing $pk$, which requires solving large instances of the following problem: given an element $\mathcal{P}$ in $\mathbb{G}_1$ or $\mathbb{G}_2$ and scalars $\alpha_1, \ldots, \alpha_n \in \mathbb{F}_r$, compute $\alpha_1\mathcal{P}, \ldots, \alpha_n\mathcal{P}$. Previous work [PGHR13, BCGTV13a], used fixed-base windowing [BGMW93] to efficiently compute such fixed-base multi-scalar multiplications.

In our implementation, we achieve additional efficiency, in space rather than in time. Specifically, we leverage a structural property of QAPs derived from arithmetic circuits, in order to reduce the size of the proving key $pk$, as we now explain. Lemma 2.4 states that an $\mathbb{F}$-arithmetic circuit $C : \mathbb{F}^m \times \mathbb{F}^h \to \mathbb{F}^l$, with $\alpha$ wires and $\beta$ gates, can be converted into a corresponding QAP of size $m = \alpha$ and degree $d \approx \beta$ over $\mathbb{F}$. Roughly, this is achieved in two steps. First, construct three matrices $A, B, C \in \mathbb{F}^{(m+1) \times d}$ that encode $C$’s topology: for each $j \in [d]$, the $j$-th column of $A, B$ respectively encodes the “left” and “right” coefficients of the $j$-th bilinear gate in $C$, while the $j$-th column of $C$ encodes the coefficients of the gate’s output. Second, letting $S \subset \mathbb{F}$ be a set of size $d$, define $Z(z) := \prod_{\omega \in S}(z - \omega)$ and, for $i \in \{0, \ldots, m\}$, let $A_i$ be the low-degree extension of the $i$-th row of $A$; similarly define each $B_i$ and $C_i$. All prior QAP-based zk-SNARK implementations exploit the fact that columns in the matrices $A, B, C$ are very sparse.

In contrast, we also leverage a different kind of sparsity: we observe that it is common for entire rows of $A, B, C$ to be all zeroes, causing the corresponding low-degree extensions to be zero polynomials. For instance, our circuit generator typically outputs a circuit for which the percentage of non-zero polynomials in $\vec{A}, \vec{B}, \vec{C}$ is only about 52%, 15%, 71% respectively. The fact that many polynomials in $\vec{A}, \vec{B}, \vec{C}$ evaluate to zero can be used towards reducing the size of $pk$, by switching from a dense representation to a sparse one.

In fact, we have engineered our circuit generator to reduce the number of non-zero polynomials in $\vec{B}$ as much as possible, because computations associated to evaluations of $\vec{B}$ are conducted with respect to more expensive $\mathbb{G}_2$ arithmetic, which we want to avoid as much as possible.

5 Evaluation

We evaluated our system on a desktop computer with a 3.40 GHz Intel Core i7-4770 CPU (with Turbo Boost disabled) and 32 GB of RAM. All experiments, except the largest listed in Figure 8 and Figure 9, used a small fraction of the RAM. For the two largest experiments in Figure 9, we added a Crucial M4 solid state disk for swap space. Finally, while our code supports multi-threading, we ran all of our experiments in single-thread mode, for ease of comparison with prior work.

5.1 Performance of our circuit generator

In Section 3, we described our universal circuit generator; we now benchmark its performance.

Parameter choices. The circuit generator supports the architecture vnTinyRAM, which is parametrized by two quantities: the word size $W$ and the number of registers $K$ (see Section 2.6). We report performance for a machine with $K = 16$ registers, and two choices of word size: $W = 16$ and $W = 32$. Also, a circuit generator is defined relative to a prime field $\mathbb{F}_p$ (see Definition 3.1) and its efficiency may in principle depend on $p$; since our construction has the same number of gates for any $p$ with $p > 2^{2W}$ (a condition that holds for any cryptographically-large $p$), in the discussion below we do not have to worry about the value of $p$.

---

7E.g., if the $i$-th wire never appears with a non-zero coefficient as the “left” input of a bilinear gate, then the $i$-th row of $A$ is zero, and thus $A_i$ is the zero polynomial.

8Moreover, 15% non-zero polynomials in $\vec{B}$ is likely not optimal: one can verify that minimizing the number of non-zero polynomials in $\vec{B}$ reduces to a minimum vertex cover problem [MR96]. It is an interesting open question whether approximation algorithms for such a problem can be used to further improve efficiency, and go below 15%.
Methodology. Theorem 3.2 provides an asymptotic efficiency guarantee: it states that our circuit generator has efficiency \( f(\ell, n, T) = \Theta((\ell + n + T) \cdot \log(\ell + n + T)) \). To understand concrete efficiency, we “uncover” the constants hidden in the big-oh notation. By studying the number of gates in various subcircuits of the generated circuit \( C := \text{circ}(\ell, n, T) \), we computed the following (quite tight) upper bound on \( C \)'s size:

\[
(12 + 2W) \cdot \ell + (12 + W) \cdot n + |C_{\text{exe}}| \cdot T + (|C_{\text{mem}}| + 4 \log H - 1.82) \cdot H
\]

where \( H := (\ell + n + 2T) \) is the “height” of the routing network, and

- for \((W, K) = (16, 16)\): \(|C_{\text{exe}}| = 777\) and \(|C_{\text{mem}}| = 211\); and
- for \((W, K) = (32, 16)\): \(|C_{\text{exe}}| = 1114\) and \(|C_{\text{mem}}| = 355\).

In Figure 7, we give per-cycle gate counts (i.e., \( |C|/T \)) for various choices of \((\ell, n, T)\); we also give sub-counts divided among program/input boot, CPU execution, memory checking, and routing. (See Figure 11 in Appendix C for an extended table with additional data.)

Discussion. We first go through the size expression, to understand it: The first two terms, \((12 + 2W) \cdot \ell + (12 + W) \cdot n\), correspond to the pre-execution boot phase, during which an \( \ell \)-instruction program and an \( n \)-word primary input are loaded into the machine. The term \(|C_{\text{exe}}| \cdot T\) corresponds to the \( T \) copies of \( C_{\text{exe}} \) used to verify each CPU transition, given the fetched instruction and two CPU states. The term \(|C_{\text{mem}}| \cdot H\) corresponds to the \( H \) copies of \( C_{\text{mem}} \) used to verify consistency on the memory-sorted trace. Finally, the term \((4 \log H - 1.82) \cdot H\) corresponds to the routing network for routing \( H \) packets (two gates for each of \(2 \log H - 0.91\) binary switches). Note that \( H = (\ell + n + 2T) \) because boot needs \( \ell + n \) memory stores (one for each program instruction and primary input word) and execution needs \( 2T \) memory accesses (1 instruction fetch and 1 data store/load per execution cycle).

The gate counts in Figure 7 demonstrate the additive (instead of multiplicative) dependence on program size of our universal circuit pays off. For example, for \((W, K) = (32, 16)\), a 100-fold increase in program size, from \( \ell = 10^3 \) to \( \ell = 10^5 \), barely impacts the per-cycle gate count: for \( T = 2^{20} \), it increases from 1,992.5 to only 2,041.5. Indeed, the cost of program size is incurred, once and for all, during the machine boot; Figure 7 shows that the per-cycle cost of machine boot diminishes as \( T \) grows.

Second, less than half of \( C \)'s gates are dedicated to verifying accesses to random-access memory, while the majority of gates are dedicated to verifying execution of the CPU; indeed, almost always, \(|C_{\text{exe}}|T > \frac{1}{2}|C|\). Put otherwise, \( C \), which verifies an automaton with random-access memory (\( \text{vntinyRAM} \)), has size that is less than twice that for verifying an automaton with the same CPU but no random-access memory at all. Moreover, note that the size of \( C_{\text{exe}} \) appears quite tight: for example, with \((W, K) = (32, 16)\), it has size 1114, not much larger than the size of the CPU state (545 bits).

|\( W = 16 \) \ | \( W = 32 \) |
|---|---|---|---|
|\( |C|/T \)\( \text{divided among} \) | \( |C|/T \)\( \text{divided among} \) | \( \text{Per} \) | \( |C|/T \)\( \text{divided among} \) | \( \text{Per} \) |
|\( n = 10^2, K = 16 \) | \( n = 10^2, K = 32 \) |
|\( T = 2^{20} \) | 1,370.3 | 0.41 | 777.0 | 424.0 | 168.0 | 1,997.0 | 0.72 | 1,114.0 | 713.4 | 168.8 |
|\( T = 2^{24} \) | 1,399.2 | 0.03 | 777.0 | 422.1 | 200.1 | 2,024.3 | 0.05 | 1,114.0 | 710.2 | 200.1 |
|\( T = 2^{28} \) | 1,431.0 | 0.00 | 777.0 | 422.0 | 232.0 | 2,056.0 | 0.00 | 1,114.0 | 710.0 | 232.0 |

Figure 7: Performance of our circuit generator: per-cycle gate counts in \( C := \text{circ}(\ell, n, T) \) for different choices of \((\ell, n, T)\) and \( \text{vntinyRAM} \) parameters \((W, K)\).
5.2 Performance of our zk-SNARK for circuit satisfiability

In Section 4 we described our zk-SNARK implementation; we now benchmark its performance.

**Methodology.** We provide performance characteristics for each of the zk-SNARK algorithms, \( G, P \) and \( V \), at the 80-bit and 128-bit security levels. We benchmark the system as follows.

(1) The key generator \( G \) takes as input an arithmetic circuit \( C: \mathbb{F}_r^n \times \mathbb{F}_r^h \rightarrow \mathbb{F}_r^f \). Its efficiency mostly depends on the number of gates and wires in \( C \), because these affect the size and degree of the corresponding QAP (see Lemma 2.4). Thus, we evaluate \( G \) on a circuit with \( 2^i \) gates and \( 2^i \) wires for \( i \in \{10,12,\ldots,24\} \) (and fixed \( n = h = l = 100 \)). In Figure 8 we report the resulting running times and key sizes, as per-gate costs.

(2) The prover \( P \) takes as input a proving key \( pk \), input \( \vec{x} \in \mathbb{F}_r^n \), and witness \( \vec{a} \in \mathbb{F}_r^h \). Its efficiency mostly depends on the number of gates and wires in \( C \) (the circuit used to generate \( pk \)); we thus evaluate \( P \) on the proving keys output by \( G \), for the same circuits as above. In Figure 8 we report the resulting running times, as per-gate costs, and (constant) proof sizes.

(3) The verifier \( V \) takes as input a verification key \( vk \), input \( \vec{x} \in \mathbb{F}_r^n \), and proof \( \pi \). Its efficiency depends only on \( \vec{x} \) (since the size of \( \vec{x} \) determines that of \( vk \)). Thus, we evaluate \( V \) on a random input \( \vec{x} \in \mathbb{F}_r^n \) of \( 2^i \) bytes for \( i \in \{2,4,\ldots,20\} \). In Figure 8 we report the resulting running times, along with corresponding key sizes.

For completeness, Figure 12 in Appendix C reports the unnormalized measurements and additional information (e.g., times for various subcomputations).

**Discussion.** The data demonstrates that our zk-SNARK implementation works and scales as expected, as long as sufficient memory is available (e.g., on a desktop computer with 32GB of DRAM: up to 16 million gates); also, as expected, the higher security level entails higher costs. Key generation takes about 10 ms per gate of \( C \); the size of a proving key is about 300 B per gate, and the size of a verification key is about 1 B per byte of input to \( C \). Running the prover takes 11 ms to 14 ms per gate. For an \( n \)-byte input, proof verification time is \( c_1 n + c_0 \), where \( c_0 \) is a few milliseconds and \( c_1 \) is a few tenths of microseconds.

**Remark 5.1.** Another factor affecting the efficiency of \( G \) and \( P \) is the number of non-zero polynomials in the QAP instance obtained from the circuit \( C \) (see Section 4.4). In Figure 8 we reported worst-case numbers in this respect: we only used circuits whose QAP has no non-zero polynomials. In general, QAP with more zero polynomials make the key generator and prover faster; in particular, the circuits output by our circuit generator induce QAP instances with many zero polynomials, so that the numbers reported in Section 5.3 are somewhat better than what one would expect by merely multiplying the per-gate costs of Figure 8 with the number of gates in the circuit output by the circuit generator.

5.3 Performance of the combined system

As discussed, our circuit generator (Section 3) and zk-SNARK for circuits (Section 4) can be used independently, or combined to obtain a zk-SNARK for \( \text{vnTinyRAM} \). For completeness, in Appendix D.2 we spell out how these two components can be combined. Here we report measured performance of this combined system, at the 128-bit security level, and for a word size \( W = 32 \) and number of registers \( K = 16 \).

**Methodology.** A zk-SNARK for \( \text{vnTinyRAM} \) is a triple of algorithms \( (\text{KeyGen}, \text{Prove}, \text{Verify}) \). Given bounds \( \ell, n, T \) (for program size, input size, and time), the efficiency of KeyGen and Prove depends on \( \ell, n, T \), while that of Verify essentially depends only on \( \ell, n \). Thus, we benchmark the system as follows.

We evaluate KeyGen and Prove for various choices of \( \ell \) and \( T \), while keeping \( n = 100 \). (Varying \( \ell \) or \( n \) affects efficiency in similar ways, so we fix \( n \) and vary \( \ell \).) Instead, since the efficiency of Verify does not depend on \( T \), we evaluate Verify, for various choices of \( \ell \) and \( n \), on random \( \ell \)-instruction programs and \( n \)-word inputs. In Figure 9, we report the following measurements: KeyGen’s running time, the sizes of the keys \( pk \) and \( vk \), Prove’s runtime, the (constant) proof size, and Verify’s running time. For quantities growing with \( T \), we divide by \( T \) and report the per-cycle cost.
Figure 8: Performance of our zk-SNARK for arithmetic circuit satisfiability: per-gate costs of the key generator and prover for various circuit sizes; and per-byte costs of the verifier for various input sizes. \((N = 10 \text{ and std } < 1\%)\)

For completeness, Figure 13 in Appendix C reports the unnormalized measurements and additional information, such as times for various subcomputations (e.g., subtimes for the circuit generator and zk-SNARK).

**Discussion.** The measurements demonstrate that, on a desktop computer, our zk-SNARK for \texttt{vnTinyRAM} scales up to computations of 32,000 machine cycles, for programs with up to 10,000 instructions. Key generation takes about 200 ms per cycle; the size of a proving key is 500 KB to 650 KB per cycle, and the size of a verification key is a few kilobytes in total. Running the prover takes 100 ms to 200 ms per cycle. Verification times remain a few milliseconds, even for inputs and programs of several kilobytes.

**Program-specific \(v_k\).** The time complexity of Verify is \(O(\ell + n)\), so verification time grows with program size. This is inevitable, because Verify must read a program \(P\) (of at most \(\ell\) instructions) and input \(z\) (of at most \(n\) words) in order to check, via the given proof \(\pi\), if \((P, z) \in \mathcal{L}_{\ell,n,T}\) (cf. Definition 2.7). However, this is inconvenient, e.g., when one has to verify many proofs relative to different inputs to the same program \(P\).

In our zk-SNARK it is possible to amortize this cost as follows. Given \(v_k\) and \(P\), one can derive, in time \(O(\ell)\), a program-specific verification key \(v_k_P\), which can be used to verify proofs relative to any input to \(P\). Subsequently, the time complexity of Verify for any input \(z\) (to \(P\)) is only \(O(n)\), independent of \(\ell\). Essentially, one can pre-compute the program-specific part of \(v_k_{\vec{z}}\) (see Step 1 in Figure 10.), so that, later, one only needs to compute the input-dependent part of \(v_k_{\vec{z}}\) and combine it with \(v_k_P\). (Conversely, it is also possible to derive an input-specific verification key, for verifying proofs relative to the same input to different programs.)

Figure 13 in Appendix C also reports the subtime to compute \(v_k_P\), which represents the time saved when
one first precomputes \( \text{vk}_P \) ahead of time.

### 5.4 Comparison with prior work

#### 5.4.1 Comparison with prior circuit generators

Universality is the main innovative feature of our circuit generator. No previous circuit generator achieves universality. (See Figure 1 and Section 3.)

Putting universality aside and focusing on efficiency instead, a comparison with previous circuit generators is a multi-faceted problem. On one hand, due to a shared core of techniques, a comparison with [BCGTV13a]'s circuit generator is straightforward, and shows significant improvements in circuit size, especially as program size grows. See Section 1.4.1 and Figure 2 (the figure’s numbers are for \( W, K = 16 \)).

Instead, a comparison with other circuit generators [SVPB+12, SBVB+13, PGHR13, BFRS+13] is complex. First, they support a smaller class of programs (see Figure 1), so a programmer must “write around” the limited functionality, somehow. And second, their efficiency is not easily specified: due to the use of program-analysis techniques (see Section 3.1) the output circuit is ad hoc for the given program, and the only way to know its size is to actually run the circuit generator.

Compared to [SVPB+12, SBVB+13, PGHR13, BFRS+13], our circuit generator performs better for programs that are rich in memory accesses and control flow, and worse for programs that are more “circuit like”.

**Comparison with [SVPB+12, SBVB+13, PGHR13].** The circuit generators in [SVPB+12, SBVB+13, PGHR13] restrict loop iteration bounds and memory accesses to be known at compile time; if a program does not respect these restrictions, it must be first somehow mapped to another one that does. For simplicity, we take [PGHR13]’s circuit generator (the latest one) as representative and, to illustrate the differences between [PGHR13]’s and our circuit generator, we consider two “extremes”.

On one extreme, we wrote a simple C program multiplying two 10 × 10 matrices of 16-bit integers. The circuit generator in [PGHR13] produces a circuit with 1100 gates\(^9\); instead, our circuit generator (when given

\(^9\)The circuit produced by [PGHR13] for int values, with “--bit-width= 16”, nonetheless performs arithmetic modulo some large prime, without reductions modulo \( 2^{16} \).
We have presented two main contributions: (i) a circuit generator for a von Neumann RISC architecture that finds widespread use in security applications, more work is required to slash costs of key generation and proving so that memory has the conceptual advance of universality.

Addressing the other component of our system, the zk-SNARK for circuits: Figure 3 compares our implementation with prior ones, on a 32,000 cycle, for programs with up to 1,000 instructions, relative to a simple universal circuit satisfiability. These two components can be used independently to the benefit of other systems, or combined into a zk-SNARK that can prove/verify correctness of computations on this architecture.

Thus, pointer chasing in our case is cheaper than in [PGHR13] already for $N > 1000$, each cycle costs about 2000 gates, and can perform a random access to memory. Thus, pointer chasing in our case is cheaper than in [PGHR13] already for $N > 5000$, and the multiplicative saving, which is about $\frac{20N}{2000 \cdot (5+10)} = \frac{N}{4500}$, grows unbounded as $N$ increases.

Comparison with [BFRS+13]. The circuit generator of [BFRS+13] is also based on program analysis, but provides an additional feature that allows data-dependent memory accesses: a program may access memory by guessing the value and verifying its validity via a subcircuit that checks Merkle-tree authentication paths. In [BFRS+13], memory consists of $2^{30}$ cells, and each access costs many gates: 140K for a load, and 280K for a store. In comparison, in our circuit generator for vnTinyRAM (with word size $W = 32$ so that memory has $2^{22}$ cells), each memory store/load costs less than 1000 gates out of about 2000 per cycle (see Section 5.1). Besides the aforementioned feature, [BFRS+13] rely on program analysis, and (as in [SVPB+12,SBVB+13,PGHR13]) only support bounded control flow. Thus, [BFRS+13] performs better than our circuit generator for programs with bounded control flow and few data-dependent accesses to memory.

It is an intriguing open question whether techniques underlying our circuit generator can be combined with program analysis so to yield circuit generators achieving good efficiency both for restricted and rich programs, and avoid the sharp functionality vs. efficiency tradeoffs that exist among current circuit generators.

5.4.2 Comparison with prior zk-SNARKs

Addressing the other component of our system, the zk-SNARK for circuits: Figure 3 compares our implementation with prior ones, on a 1-million-gate circuit with a 1000-byte input. As shown, we mildly improve the key generation time and, more importantly, significantly improve the “online” costs of proving and verification.

6 Conclusion

We have presented two main contributions: (i) a circuit generator for a von Neumann RISC architecture that is universal and scales additively with program size; and (ii) a high-performance zk-SNARK for arithmetic circuit satisfiability. These two components can be used independently to the benefit of other systems, or combined into a zk-SNARK that can prove/verify correctness of computations on this architecture.

The benefits of universality. Universality attains the conceptual advance of once-and-for-all key generation, allowing verifying all programs up to a given size. This removes major issues in prior systems: expensive per-program key generation and the thorny issue of conducting it anew in a trusted way for every program.

The price of universality. We have demonstrated that our zk-SNARK scales, on a desktop computer, up to computations of 32,000 cycles, for programs with up to 10,000 instructions, relative to a simple universal computer (vnTinyRAM). Yet, the price of universality is still very high. Going forward, and aiming for widespread use in security applications, more work is required to slash costs of key generation and proving so to scale up to larger computations: e.g., billion-gate circuits, or millions of vnTinyRAM cycles, and beyond.
An interesting open problem is whether the “program analysis” techniques underlying most prior circuit
generators [SVPB+12, SBVB+13, PGHR13, BFRS+13], typically more efficient for restricted classes of
programs, can be used to construct universal circuits (for those same classes of programs).

Beyond vnTinyRAM. Finally, going beyond the foundation of a von Neumann RISC architecture, more work
lies ahead towards a richer architecture (e.g., efficient support for floating-point arithmetic and cryptographic
acceleration), code libraries, and tighter compilers.

Acknowledgments

We thank Daniel Genkin, Raluca Ada Popa, Ron Rivest, and Nickolai Zeldovich for helpful comments and
discussions, and Lior Greenblatt, Shaul Kfir, Michael Riabzev, and Gil Timnat for programming assistance.
We thank Ariel Gabizon and Bryan Parno for finding a problem in earlier versions of Lemma 2.4 and the
protocol in Appendix B. See [Gab19] for a security proof of the protocol in the generic group model.

This work was supported by: the Center for Science of Information (CSoI), an NSF Science and
Technology Center, under grant agreement CCF-0939370; the Check Point Institute for Information Security;
the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement number
240258; the Israeli Centers of Research Excellence I-CORE program (center 4/11); the Israeli Ministry of
Science, Technology and Space; the Simons Foundation, with a Simons Award for Graduate Students in
Theoretical Computer Science; and the Skolkovo Foundation under grant agreement number 6926059.
A Other prior work

Prior work most relevant to us is about zk-SNARKs, and is discussed in Section 1.2. There are also numerous works studying variations or relaxations of the goal we consider; here, we summarize some of them.

Interactive proofs for low-depth circuits. Goldwasser et al. [GKR08] obtained an interactive proof for outsourcing computations of low-depth circuits. A set of works [CMT12, TRMP12, Tha13] has optimized and implemented the protocol of [GKR08]. The protocol of [GKR08] can also be reduced to a two-message argument system [KR09, KRR13]. Canetti et al. [CRR12] showed how to extend the techniques in [GKR08] to also handle non-uniform circuits.

Batching arguments. Ishai et al. [IKO07] constructed a batching argument for NP, where, to simultaneously verify that $N$ circuits of size $S$ are satisfiable, the verifier runs in time $\max\{S^2, N\}$.

A set of works [SBW11, SMBW12, SVPB+12, SBVB+13] has improved, optimized, and implemented the batching argument of Ishai et al. [IKO07] for the purpose of outsourcing computation. In particular, by relying on quadratic arithmetic programs of [GGPR13], Setty et al. [SBVB+13] have improved the running time of the verifier and prover to $\max\{S, N\} \cdot \text{poly}(\lambda)$ and $O(S) \cdot \text{poly}(\lambda)$ respectively.

Vu et al. [VSBW13] provide a system that incorporates both the batching arguments of [SBW11, SMBW12, SVPB+12, SBVB+13] as well as the interactive proofs of [CMT12, TRMP12, Tha13]. The system decides which of the two approaches is more efficient to use for outsourcing a given computation.

Braun et al. [BFRS+13] apply batching techniques (as well as zk-SNARKs) to verify MapReduce computations, by relying on various verifiable data structures.

Arguments with competing provers. Canetti et al. [CRR11] use collision-resistant hashes to get a protocol for outsourcing deterministic computations in a model where a verifier interacts with two computationally-bounded provers at least one of which is honest [FK97]. The protocol in [CRR11] works directly for random-access machines, and therefore does not require reducing random-access machines to any “lower-level” representation (such as circuits). Canetti et al. implement their protocol for deterministic x86 programs.

Previous circuit generators. Some prior work addresses the problem of translating high-level languages into low-level languages such as circuits. Most prior work only supports restricted classes of programs: [SVPB+12, SBVB+13] present a circuit generator based on Fairplay [MNPS04, BDNP08], whose SFDL language does not support important primitives and has inefficient support for others; [PGHR13] present a circuit generator for programs without data dependencies (pointers and array indices must be known at compile time, and so do loop iteration bounds).

Other works support more general functionality: [BCGTV13a] rely on nondeterministic routing to support random-access machine computations [BCGTV13a]; [BFRS+13] rely on online memory checking [BEGKN91, BCGT13a] to support accessing untrusted storage from a circuit.

See [BFRS+13, Section 2] for a more detailed overview of some of the above techniques.

Other cryptographic tools. Fully-homomorphic encryption (FHE) [Gen09] and probabilistically-checkable proofs [AS98, ALMSS98] are powerful tools that are often used in protocols for outsourcing computations (with integrity or confidentiality guarantees, or both) [Kil92, Mic00, AIK10, GGP10, CKV10, KRR13, GKPVZ13]. However, such constructions have so far not been explored in practice. Another powerful tool is secure multi-party computation [GMW87, BOGW88], but most work in this area does not consider the goal of succinctness.
B The PGHR zk-SNARK protocol

For the purposes of completeness and to fix notation, in Figure 10 below we recall the zk-SNARK protocol of Parno et al. [PGHR13]. The zk-SNARK can be used to prove/verify satisfiability of \( F_r \)-arithmetic circuits, where \( r \) is the order of the two cyclic groups \( G_1 \) and \( G_2 \), forming the domain of the pairing \( e : G_1 \times G_2 \to G_T \).

We refer the reader to [PGHR13] for further details regarding the intuition for the protocol, as well as the cryptographic assumptions on which its proof of security relies. (Briefly, security relies on the \( q \)-power Diffie–Hellman, \( q \)-power knowledge-of-exponent, and \( q \)-strong Diffie–Hellman assumptions [Gro10b, BB04, Gen04] for \( q \) that depends polynomially on the arithmetic circuit’s size.)

Figure 10: The zk-SNARK protocol of Parno et al. [PGHR13]. (More precisely, the protocol above differs from [PGHR13] in two ways. First, it does not assume that \( G_1 = G_2 \). Second, it has a verification key whose size grows as \( n + o(n) \), rather than \( 3n + o(n) \), by leveraging the non-degeneracy property in Lemma 2.4. The current version remedies a problem found by Gabizon [Gab19] and reported as CVE-2019-7167 in Zcash: \( pk'_A \) now starts at index \( n + 1 \), and \( pk'_A, pk'_B \) have been redefined accordingly.)
C Additional experimental data

For completeness, we report additional experimental data, beyond that reported in Section 5.

Additional data for our circuit generator. In Section 5.1 we discuss the performance of our circuit generator for \( \text{vnTinyRAM} \), and provided per-cycle gate counts in Figure 7. In Figure 11 we provide an extended version of Figure 7.

Additional data for our zk-SNARK for circuits. In Section 5.2 we discuss how we evaluated \((G, P, V)\), which is our zk-SNARK for arithmetic circuit satisfiability. In Figure 12 we report the unnormalized data from which Figure 8 is derived: we report the costs of the key generator \(G\) and prover \(P\) for various circuit sizes, and of the verifier \(V\) for various input sizes. We also provide information on various subcomputations, split between the information-theoretic ones having to do with QAPs, and the cryptographic ones having to do with exponentiations.

The reported key sizes assume that an element of \(G_1\) or \(G_2\) is compressed (i.e., a point \((x_0, y_0)\) lying on an elliptic curve \(y^2 = x^3 + Ax + B\) is encoded as \((x_0, b)\), where \(b\) is a bit distinguishing between the two square roots of \(x_0^3 + Ax_0 + B\)); to use a key, one typically first decompresses each element (and this is a one-time operation after transmission).

In the verifier, the reported running times assume that \(vk\) has been preprocessed (see Section 4.2), which is a one-time operation that can be amortized across any number of verifications.

Additional data for the combined system. In Section 5.3 we discuss how we evaluated \((\text{KeyGen}, \text{Prove}, \text{Verify})\), which is our zk-SNARK for \(\text{vnTinyRAM}\). In Figure 13 we report part of the unnormalized data from which Figure 9 is derived, and also provide the same data for word size \(W = 16\) as a comparison (since Figure 9 only reports \(W = 32\)). Concretely, we report the costs of KeyGen for various choices of program size bound \(\ell\) and time bound \(T\), while keeping the input size bound \(n\) fixed at 100; similarly for Prove. As for Verify, we report its running time for various choices of program size bound \(\ell\) and input size bound \(n\).

We also provide information on various subcomputations, specifically on how the running times are divided between the circuit generator and the zk-SNARK (the two components from which \((\text{KeyGen}, \text{Prove}, \text{Verify})\) is constructed). Namely, for KeyGen we report the subtime for running the circuit generator \(\text{circ}\) and the remaining time, which is spent in the zk-SNARK key generator \(G\). And for Prove we report the subtime for running the witness map \(\text{wit}\) and the remaining time, which is spent in the zk-SNARK prover \(P\). For Verify, we report the subtime to derive the program-specific verification key \(vk_P\) (see Section 5.3); this represents the time that is saved if one wishes to verify multiple statements \((P, \exists)\) for different inputs \(\exists\).
| $|C|/T$ | boot | exec. | mem. | routing | $|C|/T$ divided among | Per cycle | boot | exec. | mem. | routing |
|---|---|---|---|---|---|---|---|---|---|---|
| $T = 2^{16}$ | 1.337.8 | 1.41 | 777.0 | 429.1 | 130.3 | 1.968.6 | 2.44 | 1.114.0 | 721.9 | 130.3 |
| $T = 2^{17}$ | 1.340.4 | 0.70 | 777.0 | 425.5 | 137.2 | 1.968.4 | 1.22 | 1.114.0 | 716.0 | 137.2 |
| $T = 2^{18}$ | 1.345.8 | 0.35 | 777.0 | 423.8 | 144.6 | 1.972.2 | 0.61 | 1.114.0 | 713.0 | 144.6 |
| $T = 2^{19}$ | 1.352.4 | 0.18 | 777.0 | 422.9 | 152.3 | 1.978.1 | 0.30 | 1.114.0 | 711.5 | 152.3 |
| $T = 2^{20}$ | 1.359.7 | 0.09 | 777.0 | 422.4 | 160.2 | 1.985.1 | 0.15 | 1.114.0 | 710.7 | 160.2 |
| $T = 2^{21}$ | 1.367.4 | 0.04 | 777.0 | 422.2 | 168.1 | 1.992.5 | 0.08 | 1.114.0 | 710.4 | 168.1 |
| $T = 2^{22}$ | 1.375.2 | 0.02 | 777.0 | 422.1 | 176.0 | 2.000.3 | 0.04 | 1.114.0 | 710.2 | 176.0 |
| $T = 2^{23}$ | 1.383.1 | 0.01 | 777.0 | 422.0 | 184.0 | 2.008.1 | 0.02 | 1.114.0 | 710.1 | 184.0 |
| $T = 2^{24}$ | 1.391.0 | 0.00 | 777.0 | 422.0 | 192.0 | 2.016.1 | 0.01 | 1.114.0 | 710.0 | 192.0 |
| $T = 2^{25}$ | 1.399.0 | 0.00 | 777.0 | 422.0 | 200.0 | 2.024.0 | 0.00 | 1.114.0 | 710.0 | 200.0 |
| $T = 2^{26}$ | 1.407.0 | 0.00 | 777.0 | 422.0 | 208.0 | 2.032.0 | 0.00 | 1.114.0 | 710.0 | 208.0 |
| $T = 2^{27}$ | 1.415.0 | 0.00 | 777.0 | 422.0 | 216.0 | 2.040.0 | 0.00 | 1.114.0 | 710.0 | 216.0 |
| $T = 2^{28}$ | 1.423.0 | 0.00 | 777.0 | 422.0 | 224.0 | 2.048.0 | 0.00 | 1.114.0 | 710.0 | 224.0 |
| $T = 2^{29}$ | 1.431.0 | 0.00 | 777.0 | 422.0 | 232.0 | 2.056.0 | 0.00 | 1.114.0 | 710.0 | 232.0 |
| $T = 2^{30}$ | 1.439.0 | 0.00 | 777.0 | 422.0 | 240.0 | 2.064.0 | 0.00 | 1.114.0 | 710.0 | 240.0 |
| $T = 2^{31}$ | 1.447.0 | 0.00 | 777.0 | 422.0 | 248.0 | 2.072.0 | 0.00 | 1.114.0 | 710.0 | 248.0 |

Figure 11: Performance of our circuit generator for **TinyRAM** for various choices of $(\ell, n, T)$.}

27
Key generator $G$

| Gate count | Total time $|Q_A|\tau|$ | Subtime for computing $|v_K|$ | $\pi$ | $|v_K|$ | $|v_K|$ |
|-----------|-----------------|-----------------|----|-----------------|-----------------|
| 2^0       | 0.2 s           | 2.1 ms         | 0.2 s | 5.4 ms         | 204.8 KB | 2.8 KB |
| 2^1       | 0.4 s           | 4.1 ms         | 0.4 s | 5.4 ms         | 514.6 KB | 2.8 KB |
| 2^2       | 0.7 s           | 8.1 ms         | 0.6 s | 5.4 ms         | 1.0 MB   | 2.8 KB |
| 2^3       | 1.2 s           | 16.2 ms        | 1.2 s | 5.4 ms         | 2.1 MB   | 2.8 KB |
| 2^4       | 2.3 s           | 32.4 ms        | 2.3 s | 5.7 ms         | 4.2 MB   | 2.8 KB |
| 2^5       | 4.3 s           | 64.8 ms        | 4.2 s | 5.3 ms         | 8.3 MB   | 2.8 KB |
| 2^6       | 8.0 s           | 129.2 ms       | 7.8 s | 5.2 ms         | 16.6 MB  | 2.8 KB |

Prover $P$

| Gate count | Total time $|H(z)|\pi$ | $\pi$ | $|v_K|$ | $|v_K|$ |
|-----------|-----------------|-----------------|----|-----------------|-----------------|
| 2^0       | 189.0 ms        | 3.4 ms         | 182.5 ms | 230 B |
| 2^1       | 0.3 s           | 7.3 ms         | 0.3 s | 230 B |
| 2^2       | 0.6 s           | 15.5 ms        | 0.6 s | 230 B |
| 2^3       | 1.2 s           | 32.9 ms        | 1.2 s | 230 B |
| 2^4       | 2.2 s           | 69.6 ms        | 2.2 s | 230 B |
| 2^5       | 4.3 s           | 147.5 ms       | 4.2 s | 230 B |

Verifier $V$

| Input size | Total time | $|v_K|\pi$ | $|v_K|$ | $|v_K|$ |
|-----------|-----------------|-----------------|----|-----------------|-----------------|
| 2 B       | 475 B           | 4.8 ms         | 0.1 ms | 4.7 ms         |
| 4 B       | 475 B           | 4.8 ms         | 0.1 ms | 4.7 ms         |
| 8 B       | 475 B           | 4.8 ms         | 0.1 ms | 4.7 ms         |
| 16 B      | 475 B           | 4.8 ms         | 0.1 ms | 4.7 ms         |
| 32 B      | 498 B           | 4.9 ms         | 0.2 ms | 4.7 ms         |
| 64 B      | 521 B           | 4.9 ms         | 0.2 ms | 4.7 ms         |
| 128 B     | 590 B           | 4.9 ms         | 0.2 ms | 4.7 ms         |
| 256 B     | 728 B           | 5.0 ms         | 0.3 ms | 4.7 ms         |
| 512 B     | 1.0 KB          | 5.2 ms         | 0.5 ms | 4.7 ms         |
| 1.0 KB    | 1.5 KB          | 5.6 ms         | 0.9 ms | 4.7 ms         |
| 2.0 KB    | 2.6 KB          | 6.2 ms         | 1.5 ms | 4.7 ms         |
| 4.1 KB    | 4.7 KB          | 7.4 ms         | 2.6 ms | 4.7 ms         |
| 8.2 KB    | 8.8 KB          | 9.5 ms         | 4.8 ms | 4.7 ms         |
| 16.4 KB   | 17.2 KB         | 13.4 ms        | 8.7 ms | 4.7 ms         |
| 32.8 KB   | 34.0 KB         | 21.7 ms        | 16.8 ms | 4.9 ms |
| 64.5 KB   | 67.5 KB         | 34.6 ms        | 29.8 ms | 4.7 ms |
| 131.1 KB  | 134.4 KB        | 61.2 ms        | 56.5 ms | 4.7 ms |
| 262.1 KB  | 268.4 KB        | 112.7 ms       | 107.8 ms | 4.9 ms |
| 524.5 KB  | 536.8 KB        | 207.7 ms       | 203.0 ms | 4.8 ms |
| 1.0 MB    | 1.1 MB          | 395.1 ms       | 390.3 ms | 4.8 ms |

Figure 12: Performance of our zk-SNARK for arithmetic circuit satisfiability, for the two security levels we considered in this paper. ($N = 10$ and std $< 1\%$)
<table>
<thead>
<tr>
<th>KeyGen</th>
<th>( W = 16 )</th>
<th>( 128 \text{-bit security, } K = 16, n = 100 )</th>
<th>( W = 32 )</th>
<th>( \ell = 2K )</th>
<th>( \ell = 4K )</th>
<th>( \ell = 2K )</th>
<th>( \ell = 4K )</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( T = 4K )</td>
<td>( 538.1 \text{s} )</td>
<td>( 4.5 \text{s} )</td>
<td>( 533.6 \text{s} )</td>
<td>( 606.4 \text{s} )</td>
<td>( 4.9 \text{s} )</td>
<td>( 601.5 \text{s} )</td>
</tr>
<tr>
<td></td>
<td>( T = 8K )</td>
<td>( 996.9 \text{s} )</td>
<td>( 8.4 \text{s} )</td>
<td>( 988.4 \text{s} )</td>
<td>( 1040.9 \text{s} )</td>
<td>( 8.9 \text{s} )</td>
<td>( 1032.1 \text{s} )</td>
</tr>
<tr>
<td></td>
<td>( T = 16K )</td>
<td>( 1927.7 \text{s} )</td>
<td>( 16.2 \text{s} )</td>
<td>( 1911.5 \text{s} )</td>
<td>( 1980.1 \text{s} )</td>
<td>( 16.8 \text{s} )</td>
<td>( 1963.2 \text{s} )</td>
</tr>
<tr>
<td></td>
<td>( T = 32K )</td>
<td>( 3896.0 \text{s} )</td>
<td>( 32.0 \text{s} )</td>
<td>( 3864.0 \text{s} )</td>
<td>( 3956.9 \text{s} )</td>
<td>( 32.5 \text{s} )</td>
<td>( 3924.4 \text{s} )</td>
</tr>
<tr>
<td>Prove</td>
<td>( W = 16 )</td>
<td>( \ell = 2K )</td>
<td>( \ell = 4K )</td>
<td>( W = 32 )</td>
<td>( \ell = 2K )</td>
<td>( \ell = 4K )</td>
<td></td>
</tr>
<tr>
<td></td>
<td>( T = 4K )</td>
<td>( 1.5 \text{GB} )</td>
<td>( 8.7 \text{KB} )</td>
<td>( 1.7 \text{GB} )</td>
<td>( 16.8 \text{KB} )</td>
<td>( 2.3 \text{GB} )</td>
<td>( 17.0 \text{KB} )</td>
</tr>
<tr>
<td></td>
<td>( T = 8K )</td>
<td>( 2.8 \text{GB} )</td>
<td>( 8.7 \text{KB} )</td>
<td>( 3.0 \text{GB} )</td>
<td>( 16.8 \text{KB} )</td>
<td>( 4.4 \text{GB} )</td>
<td>( 17.0 \text{KB} )</td>
</tr>
<tr>
<td></td>
<td>( T = 16K )</td>
<td>( 5.5 \text{GB} )</td>
<td>( 8.7 \text{KB} )</td>
<td>( 5.7 \text{GB} )</td>
<td>( 16.8 \text{KB} )</td>
<td>( 8.6 \text{GB} )</td>
<td>( 17.0 \text{KB} )</td>
</tr>
<tr>
<td></td>
<td>( T = 32K )</td>
<td>( 10.9 \text{GB} )</td>
<td>( 8.7 \text{KB} )</td>
<td>( 11.1 \text{GB} )</td>
<td>( 16.8 \text{KB} )</td>
<td>( 17.1 \text{GB} )</td>
<td>( 17.0 \text{KB} )</td>
</tr>
<tr>
<td>Verify</td>
<td>( W = 16 )</td>
<td>( \ell = 2K )</td>
<td>( \ell = 4K )</td>
<td>( W = 32 )</td>
<td>( \ell = 2K )</td>
<td>( \ell = 4K )</td>
<td></td>
</tr>
<tr>
<td></td>
<td>( n = 0 )</td>
<td>( 13.1 \text{ms} )</td>
<td>( 7.5 \text{ms} )</td>
<td>( 5.6 \text{ms} )</td>
<td>( 19.1 \text{ms} )</td>
<td>( 13.5 \text{ms} )</td>
<td>( 5.6 \text{ms} )</td>
</tr>
<tr>
<td></td>
<td>( n = 10 )</td>
<td>( 13.2 \text{ms} )</td>
<td>( 7.5 \text{ms} )</td>
<td>( 5.7 \text{ms} )</td>
<td>( 19.2 \text{ms} )</td>
<td>( 13.5 \text{ms} )</td>
<td>( 5.7 \text{ms} )</td>
</tr>
<tr>
<td></td>
<td>( n = 10^2 )</td>
<td>( 13.5 \text{ms} )</td>
<td>( 7.5 \text{ms} )</td>
<td>( 6.0 \text{ms} )</td>
<td>( 19.6 \text{ms} )</td>
<td>( 13.5 \text{ms} )</td>
<td>( 6.0 \text{ms} )</td>
</tr>
<tr>
<td></td>
<td>( n = 10^3 )</td>
<td>( 15.4 \text{ms} )</td>
<td>( 7.5 \text{ms} )</td>
<td>( 7.9 \text{ms} )</td>
<td>( 21.5 \text{ms} )</td>
<td>( 13.6 \text{ms} )</td>
<td>( 7.9 \text{ms} )</td>
</tr>
<tr>
<td></td>
<td>( n = 10^4 )</td>
<td>( 29.4 \text{ms} )</td>
<td>( 7.5 \text{ms} )</td>
<td>( 21.9 \text{ms} )</td>
<td>( 35.3 \text{ms} )</td>
<td>( 13.5 \text{ms} )</td>
<td>( 21.8 \text{ms} )</td>
</tr>
</tbody>
</table>

Figure 13: Performance of our zk-SNARK for vrnTinyRAM at the 128-bit security level, for word sizes \( W = 16 \) and \( W = 32 \). (\( N = 10 \) and \( \text{std} < 1.5\% \) for all, except that \( \text{std} < 5\% \) whenever \( T = 32K \))
D zk-SNARKs for vnTinyRAM

For completeness, we explain how a circuit generator for vnTinyRAM (see Section 3) can be combined with a zk-SNARK for arithmetic circuit satisfiability to obtain a zk-SNARK for vnTinyRAM. We first informally define zk-SNARKs for vnTinyRAM (Appendix D.1) and then we give the construction (Appendix D.2).

D.1 Informal definition

A zk-SNARK for vnTinyRAM is a cryptographic primitive that gives short and easy-to-verify non-interactive zero-knowledge proofs of knowledge for the correct execution of programs on the machine vnTinyRAM. Below, we only provide an informal definition; for details, we refer the reader to [BCIOP13], where a formal definition for any random-access machine can be found. Below, the security parameter is implicit.

A zk-SNARK for vnTinyRAM is a triple of polynomial-time algorithms (KeyGen, Prove, Verify) working as follows.

- **KeyGen** \((\ell, n, T) \rightarrow (pk, vk)\). On input a program size bound \(\ell\), time bound \(T\), and primary-input size bound \(n\), the key generator KeyGen probabilistically samples a proving key \(pk\) and a verification key \(vk\).

The keys \(pk\) and \(vk\) are published as public parameters and can be used, any number of times, to prove and verify membership in the language \(L_{\ell,n,T}\) as follows.

- **Prove** \((pk, P, x, w) \rightarrow \pi\). On input a program \(P\) with \(\leq \ell\) instructions, primary input \(x\) with \(\leq n\) words, and auxiliary input \(w\) such that \(P(x, w)\) accepts in \(\leq T\) steps, the prover Prove outputs a non-interactive proof \(\pi\) for the statement \((P, x) \in L_{\ell,n,T}\).

- **Verify** \((vk, P, x, \pi) \rightarrow b\). On input a program \(P\) with \(\leq \ell\) instructions, primary input \(x\) with \(\leq n\) words, and proof \(\pi\), the verifier Verify outputs \(b = 1\) if he is convinced that \((P, x) \in L_{\ell,n,T}\).

The key generator KeyGen is universal: it does not depend on the program \(P\) or primary input \(x\), but only on their respective size bounds \(\ell\) and \(n\) (as well as the time bound \(T\)).

A zk-SNARK satisfies the following properties.

**Completeness.** The honest prover can convince the verifier for any instance in the language. I.e., for every \((P, x) \in L_{\ell,n,T}\) with a witness \(w\),

\[
\Pr \left[ \text{Verify}(vk, P, x, \pi) = 1 \bigg| (pk, vk) \leftarrow \text{KeyGen}(\ell, n, T), \pi \leftarrow \text{Prove}(pk, P, x, w) \right] = 1 .
\]

**Succinctness.** An honestly-generated proof \(\pi\) has \(O(1)\) bits, and Verify\((vk, P, x, \pi)\) runs in time \(O(\ell + n)\). In particular, verification time does not depend on the time bound \(T\).

**Proof of knowledge (and soundness).** If the verifier accepts a proof, the prover “knows” a witness for the instance. (Thus, soundness holds.) I.e., for every polynomial-size adversary \(A\) there is a polynomial-size witness extractor \(E\) s.t.

\[
\Pr \left[ \text{Verify}(vk, P, x, \pi) = 1 \bigg| (P, x) \notin R_{\ell,n,T} \right] \Pr \left[ (pk, vk) \leftarrow \text{KeyGen}(\ell, n, T), (P, x, \pi) \leftarrow A(pk, vk), w \leftarrow E(pk, vk) \right] \leq \text{negl} .
\]

**Zero knowledge.** The proof \(\pi\) is statistical zero knowledge.
KeyGen
- **INPUTS**: bounds $\ell, n, T$
- **OUTPUTS**: proving key $pk$ and verification key $vk$
  1. Compute $C := \text{circ}(\ell, n, T)$.
  2. Compute $(pk, vk) := G(C)$, and output $(pk, vk)$.

Prove
- **INPUTS**: proving key $pk$ and $(P, x) \in L_{\ell, n, T}$ with witness $w$
- **OUTPUTS**: proof $\pi$
  1. Compute $\vec{x} := \left[ P \right]^{r_{2W}} \circ \left[ x \right]^{n_{W}}$.
  2. Compute $\vec{a} := \text{wit}(\ell, n, T, P, x, w)$.
  3. Compute $\pi := P(pk, \vec{x}, \vec{a})$, and output $\pi$.

Verify
- **INPUTS**: verification key $vk$ and $(P, x) \in L_{\ell, n, T}$
- **OUTPUTS**: decision bit
  1. Compute $\vec{x} := \left[ P \right]^{r_{2W}} \circ \left[ x \right]^{n_{W}}$.
  2. Compute $b := V(vk, \vec{x}, \pi)$, and output $b$.

Figure 14: A zk-SNARK for vnTinyRAM is obtained by combining the circuit generator and the zk-SNARK for circuit satisfiability.

### D.2 Construction

Let $(G, P, V)$ be a zk-SNARK for $\mathbb{F}_r$-arithmetic circuit satisfiability, and let $(\text{circ}, \text{wit})$ be a circuit generator for vnTinyRAM over $\mathbb{F}_r$. (The prime $r$ is typically determined by the order of the two cyclic groups $G_1$ and $G_2$ that form the domain of the pairing $e : G_1 \times G_2 \to G_T$ used to instantiate $(G, P, V)$.) In Figure 14 below, we give the construction of the three algorithms (KeyGen, Prove, Verify) of a zk-SNARK for vnTinyRAM.
E Case study: `memcpy` with just-in-time compilation

The function `memcpy` is a standard C function that works as follows: given as input two array pointers and a length, `memcpy` copies the contents of one array to the other. Of course, with no data dependencies, copying data in a circuit is trivial: you just connect the appropriate wires. However, when the array addresses and their lengths are unknown, and `memcpy` is invoked as a subroutine in a larger program, the trivial solution does not work, and an efficient implementation is needed.

A naive implementation of `memcpy` iterates, via a loop, over each array position $i$ and copies the $i$-th value from one array to the other. In `vnTinyRAM` each such loop iteration costs 6 instructions; 2 of these are to increase the iteration counter and jump back to the start of the loop. Thus, for $m$-long arrays, copying takes $6m$ instructions (discounting loop initialization). A cost of $6m$ also holds for TinyRAM of [BCGTV13a].

But, in `vnTinyRAM`, one can do better: loop unrolling can be used to avoid paying for the 2 “control” instructions. Asymptotically, the optimal number of unrollings depends on the array length: it is $\Theta(\sqrt{m})$. Thus, optimal unrolling requires dynamic code generation on a von Neumann architecture. We wrote a 54-instruction `vnTinyRAM` program for `memcpy` that uses dynamic loop unrolling to achieve an efficiency of $\approx 4m + 11.5\sqrt{m}$ cycles for $m$-long arrays. For $m \geq 600$, we get $1.25\times$ speed-up over the naive implementation, and $1.4\times$ speed-up for $m \geq 3000$. 
References


