Folding-based zkLLM

Wilbert Wu

Abstract
This paper introduces a new approach to construct zero-knowledge large language models (zkLLM) based on the Folding technique. We first review the concept of Incrementally Verifiable Computation (IVC) and compare the IVC constructions based on SNARK and Folding. Then we discuss the necessity of Non-uniform IVC (NIVC) and present several Folding schemes that support more expressive circuits, such as SuperNova, Sangria, Origami, HyperNova, and Protostar. Based on these techniques, we propose a zkLLM design that uses a RAM machine architecture with a set of opcodes. We define corresponding constraint circuits for each opcode and describe the workflows of the prover and verifier. Finally, we provide examples of opcodes to demonstrate the circuit construction methods. Our zkLLM design achieves high efficiency and expressiveness, showing great potential for practical applications.

1 IVC

IVC (Incrementally Verifiable Computation) is a new primitive in Folding ZK. Based on Folding, IVC can be constructed more efficiently, where the prover proves the correctness of incremental computation \( y = F^n(x) \).

1.1 IVC from SNARK
Previously, IVC was constructed based on SNARK, and the recursive circuit included computation logic and SNARK verification logic.

Since SNARK verification logic needs to be expressed in the circuit, and SNARK verification logic often involves some circuit-unfriendly operations, such as pairing and non-native operations, the SNARK-based IVC construction has a large recursive overhead.

1.2 IVC from Folding
The IVC construction based on Folding replaces the SNARK verifier circuit with the Folding verifier circuit, greatly reducing the recursive overhead. In Nova, the prover only needs to perform 2 MSM operations of \( O(C) \) scale and a small amount of hashing for each step, without requiring FFT. Practice shows that Nova’s recursive overhead is around 10,000 constraints.
2 NIVC

IVC requires the computation logic of each iteration to be the same, which has a gap with practical applications. For example, a CPU executes arbitrary instructions from the instruction set in each iteration. This requires the use of NIVC. NIVC (Non-uniform IVC) allows the computation function executed in each iteration to vary within the function set.

To construct NIVC, a naive method is to ”use a universal circuit to express multiple step functions”, which is the commonly used Selector scheme nowadays: laying out all instruction circuits and then activating one of the step instruction circuits through a Selector. This approach has a major drawback: the scale of the universal circuit will expand to the sum of all instruction circuits. In a VM or LLM, an instruction circuit represents a supported instruction. In this case, if the supported instruction set contains a large number of instructions, the final universal circuit will have a large expansion. This path is not ideal, so we need to find an alternative: is there a Folding scheme that supports multiple step functions??

2.1 SuperNova

The answer is SuperNova[2], an extension of Nova[1]. Its outstanding feature is that the cost of each step of the prover is only proportional to the size of the instruction circuit invoked by the program, which is extremely advantageous when the instruction set is complex.
2.2 Folding More

SuperNova supports only R1CS. In practical applications, circuits with richer expressiveness may be needed, such as custom gates and lookups. There are many existing schemes exploring how to fold circuits with richer expressiveness, such as Sangria\cite{4}, Origami\cite{3}, HyperNova\cite{6}, and Protostar\cite{5}.

3 zkLLM

3.1 Design

We construct a RAM machine that supports an instruction set \( \mathcal{I} = \mathcal{I}_\ell \) of size \( \ell \). This RAM machine has 3 control registers \( ts, pc, flag \) for flow control; 1 general-purpose register \( [gpr] \) for recording the commitment of a general-purpose register of length \( k \); 1 stack pointer register and 1 stack register \( sp, [sm] \) for recording the commitment of a stack pointer and stack memory of length \( m \), respectively. \( in \) and \( out \) are used to record the input and output, and \( [p] \) is used to record the input program.

\[
\begin{array}{cccccccc}
| ts | pc | flag | [gpr] | sp | [sm] | in | out | [p] |
\end{array}
\]

Table 1: Registers for flow control, numerical calculations, and input and output

The design distinguishes between general-purpose registers and stack memory, referring to the design of registers and memory in typical computer architectures. When operating only on registers, the overhead is smaller because the length of registers is limited and the cost of opening their commitments is small.
For each opcode, a corresponding augmented circuit is defined:

\[ F_j^{'}(v_k, U_i, u_i, (ts_i, pc_i, ip_i, flag_i, sp_i, [sm]_i, in_0, out_i, [gpr]_i, [p]_0, T) \rightarrow x: \]

1. Update \( ts_{i+1} \leftarrow ts_i + 1 \)

2. Calculate the instruction \( [ip]_{i+1} \in Z_{t+1}^* \leftarrow \phi(pc_i, [p]_0) \) to be executed in the current step based on the \( pc_i \) input from the previous step

3. If \( ts_i == 0 \) (for INIT):
   
   (a) Initialize the running instance list \( U_{i+1} \leftarrow u^\ell \)

   (b) Check that \( in_0 \) is correctly uploaded to \( gpr \), which is an index-lookup check, as shown in Figure 3

   (c) Verify \( [sm]_i = [0^m], pc = 0, flag = 0 \) and \( sp = 0 \) to ensure correct initialization

4. Otherwise:
   
   (a) Verify \( u_i.x = \text{hash}(v_k, U_i, ts_i, pc_i, ip_i, flag_i, sp_i, [sm]_i, in_0, out_i, [gpr]_i, [p]_0) \) to ensure the output of the previous step is the input of the current step

   (b) Verify \( j = ip_{i+1} \) to ensure the correct instruction circuit is constrained

   (c) Verify \( (u_i, E, u_i, u) = (0, 1) \) to ensure the augmented circuit strictly holds

   (d) Update the running instance list \( U_{i+1}[ip_i] \leftarrow \text{NIFS}(v_k[ip_i], U_i[ip_i], u_i, T) \), folding the augmented circuit

5. Update the register state according to the opcode

\[ (pc_{i+1}, flag_{i+1}, [gpr]_{i+1}, [sm]_{i+1}) \leftarrow F_j(pc_i, flag_i, [gpr]_i, [sm]_i) \]

6. Output

\[ x \leftarrow \text{hash}(v_k, U_{i+1}, ts_{i+1}, pc_{i+1}, ip_{i+1}, flag_{i+1}, sp_{i+1}, [sm]_{i+1}, in_0, out_{i+1}, [gpr]_{i+1}, [p]_0) \]

Notes:

1. \([p], [gpr], [sm]\) represent the commitment of the input vector (general-purpose register), input code, and stack memory

2. \( p \) represents the input code, which is a table consisting of \((pc, ip, operand)\), as shown in Figure 4

3. The \( \phi(\cdot) \) in step 2 is a decoder that takes the program \( p \) and program counter \( pc \) as input and calculates the instruction and operands to be executed in the current step, which is essentially an index-lookup, as shown in Figure 5

4. The opcode updates some registers and outputs, such as MUL and ADD modifying the \( pc \), ip, and \([gpr]\) registers; JMP and JMPC modifying the \( pc \) and ip registers; and S_MUL modifying the \( sp \) and \([sm]\) registers
The prover updates the proof $\Pi$ using the trace at each step, $P(pk, \Pi_i) \rightarrow \Pi_{i+1}$:

1. Parse the proof $\Pi_i$ of step $i$ as
   $((U_i, W_i), (u_i, w_i), (ts_i, pc_i, ip_i, flag_i, sp_i, [sm]_i, in_{0, out}, [gpr]_i, [p]_0))$

2. If $ts_i == 0$:
   (a) Initialize $((U_{i+1}, W_{i+1}), T) \leftarrow (u_{\perp}^i, w_{\perp}^i, u_{\perp}E)$
   (b) Initialize $[sm]_i = [0^m]$, $pc = 0$, $flag = 0$, $sp = 0$, $[gpr]$

3. Otherwise:
   (a) Update the corresponding running instance according to the instruction pointer
   $((U_{i+1}[ip_i], W_{i+1}[ip_i]), T) \leftarrow NIFS.P(pk[ip_i], (U_i[ip_i], W_i[ip_i]), (u_i, w_i))$
   (b) Calculate the instruction $[p]_{i+1} \in Z_{i+1}^* \leftarrow \varphi(pc_i, [p]_0)$ to be executed in the current step
Figure 5: lookup the triple in the program table

(c) Calculate the trace of the current step’s augmented circuit
\[(u_{i+1}, w_{i+1}) \leftarrow \text{trace}(F'_{ip,i+1}, vk, U_i, u_i, ts_i, pc_i, ip_i, flag_i, sp_i, [sm]_i, in_0, out_i, [gpr]_i, [p]_0, \tilde{T})\]

(d) Update the proof \(\Pi_{i+1} \leftarrow ((u_i, w_i), ip_{i+1})\)

The verifier obtains the proof \(\Pi_i\) from the last folding \(\mathcal{V}(vk, \Pi_i) \rightarrow \{0, 1\}\):

1. If \(ts_i == 0\):
   (a) Verify that \(in_0\) is correctly uploaded to \(gpr\)
   (b) Verify \([sm]_i = [0^m], pc = 0\), flag = 0, sp = 0 to ensure correct initial values

2. Otherwise:
   (a) Parse the proof \(\Pi_i\) as
      \(((U_i, W_i), (u_i, w_i), (ts_i, pc_i, ip_i, flag_i, sp_i, [sm]_i, in_0, out_i, [gpr]_i, [p]_0))\)
   (b) Verify \(u_i.x = \text{hash}(vk, U_i, ts_i, pc_i, ip_i, flag_i, sp_i, [sm]_i, in_0, out_i, [gpr]_i, [p]_0)\)
   (c) \(\varphi(pc_i, [p]_0) = \text{endvar}\)
   (d) Verify \((u_i, \tilde{E}, u_i, u) = (0, 1)\)
   (e) For \(F'_{ip}\), verify the \((u_i, w_i)\) of the last iteration
   (f) For all \(F'\), verify all \((U_i[j], W_j)_{j \in [\ell]}\)

3.2 Opcode Examples

3.2.1 INIT

INIT is used to upload the initial input \(in_0\) agreed upon by both parties (including the consensus of private input) to the predetermined addresses of \(gpr\).

The INIT circuit needs to constrain:
1. \( t_s = 0, [sm]_i = [0^n], pc = 0, \) flag = 0, sp = 0

2. The values at the predetermined positions of \( gpr \) are equal to the input, 
   \( in_0 = \text{OPEN}(gpr_i, addr) \), where \( \text{OPEN} \) can be implemented as the index
   lookup constraint shown in Figure 3

3. Update \( pc_{i+1} = pc_i + 1 \), and the rest of the register states remain un-
   changed

   Note: The maximum length of the initial input is specified as \( d \), and less
   than that is padded with 0. For example, if the maximum initial length is \( d = 4 \),
   and the initial input is \( in_0 = [1, 2, 0, 0] \), then 2 zeros are padded. The initial
   register state satisfying the constraints should be:

   ![Figure 6: INIT legal state](image)

3.2.2 ADD1.4

ADD1.4 \( addr_0 \) \( addr_1 \) \( addr_2 \) is used for the addition of two 1*4 tensors at specified
addresses, and the result is stored at the specified address. The update requires
constraints:

1. The computation at the specified addresses of \( gpr \) is correct: left ←
   \( \text{OPEN}([gpr]_i, addr_0) \), right ← \( \text{OPEN}([gpr]_i, addr_1) \), output ← \( \text{OPEN}([gpr]_{i+1}, addr_2) \),
   output = left + right

2. Update \( pc_{i+1} = pc_i + 1, [gpr]_{i+1} = \text{UPDATE}([gpr]_i, addr_2) \)

   Note: ← represents generating an intermediate variable and constraining it.
   The UPDATE constraint is \( [gpr]_{i+1} = [gpr]_i + \sum addr_2 ([gpr]_{i+1} - [gpr]_i) \cdot [L_{addr_2}(X)] \), and its complexity is related to the number of modified addresses.

   For example, ADD1.4 0 4 8 adds \([1, 2, 0, 0]\) at address 0 and \([1, 2, 3, 4]\) at
   address 4, and places the result at address 8.
3.2.3 LE1\_4

LE1\_4 \text{addr}_0 \text{ addr}_1 is used to compare two 1*4 tensors at specified addresses, and the flag is updated based on the comparison result. The circuit needs to constrain:

1. If \( a \geq b \): \( \text{flag}_{i+1} = 1 \)
2. Update \( \text{pc}_{i+1} = \text{pc}_i + 1 \)

For example, LE1\_4 0 4 compares \([1, 2, 0, 0]\) at address 0 with \([1, 2, 3, 4]\) at address 4.

---

Figure 7: ADD1\_4 legal state (ignoring the unchanged stack memory and output)
3.2.4 JMPC

JMPC addr_0 addr_1 is used to jump based on flag. The circuit needs to constrain:

1. If flag_{i+1} holds, update pc_{i+1} = addr_0, otherwise update pc_{i+1} = addr_1

For example, JMPC 1 4 sets pc to 1.

![State transition diagram]

Figure 9: JMPC legal state (ignoring the unchanged registers, stack memory, and output)

3.2.5 RETURN

RETURN addr is used to download the data at the specified address in addr to out. The circuit needs to constrain:

1. Update out_{i+1} = OPEN([gpr]_i, addr)
2. Update pc_{i+1} = pc_{end}

The maximum length d of the output out is specified. For example, RETURN 8 outputs [2, 4, 3, 4] at address 8 to out.

3.2.6 END

END is used to indicate the end of the program execution. The circuit needs to constrain:

1. All states remain unchanged
Figure 10: RETURN legal state (ignoring the unchanged stack memory)

References


