A Privacy-Preserving Reinforcement Learning Approach for Dynamic Treatment Regimes on Health Data

,


Introduction
As a recent healthcare tendency, personalized medicine [1] enables the patient to obtain early diagnoses, risk estimation, optimal treatments with low costs by using molecular and cellular analysis technologies, diagnosis results, genetic information, etc. Personalized medicine is usually implemented by the dynamic treatment regime technology [2,3], which can provide various therapeutic methods according to the time-varying clinical states of the patient. This technology is particularly suitable for coping with complex chronic illnesses, such as diabetes, mental diseases, alcohol dependence, and human immunodeficiency virus infection, which have various stages.
Reinforcement learning [4], which is implemented by trial-and-error and interaction with the dynamic environment, is an important method for developing dynamic treatment regimes, industry automation, vehicular networks [5,6], and other scenarios [7][8][9][10][11][12]. Meanwhile, with the developing technologies of internet of things and cloud computing, dynamic treatment regimes that are based on reinforcement learning are becoming increasingly attractive. For example, wearable devices are helpful for monitoring the patient's health data, which include heart rates and blood sugar levels. Next, collected health data are stored on the cloud. Then, the reinforcement learning algorithm can be implemented on these health data for making treatment decisions.
Unfortunately, because of the patient's limited computation ability, health data are usually outsourced to the cloud server for implementing the reinforcement learning algorithm. Because the cloud server may be untrusted, it is likely that health data will be illegally accessed, forged, tampered, or discarded in the process of transmission and computation. In addition, it may be harmful for personal privacy, economic interests, and even the security of human life. For example, as a billing service company, American Medical Collection Agency was intruded in 2019 [13]. This attack affects the health data of about 12 million patients. Besides, the parent firm of this company has filed for bankruptcy. Furthermore, unlike financial data or other types of human-generated data [14], health data are permanent biological data. They cannot be modified or wiped to avoid the damage, which is caused by health data disclosure.
In order to protect health data, we can encrypt health data by using a traditional encryption algorithm. Unfortunately, the reinforcement learning algorithm cannot be executed on the encrypted health data easily and flexibly. Homomorphic encryption [15] supports the operations on the ciphertext. Hence, the cloud server can run the reinforcement learning algorithm on the encrypted health data perfectly by using homomorphic encryption without leaking patient privacy. Finally, the encrypted computation result is returned to the patient. The computation result can be obtained by using the patient's secret key.
In this paper, we endeavor to study the security of health data in the above realistic scenario and focus on the secure implementation of the asynchronous advantage actor-critic (A3C) reinforcement learning algorithm. Taking into account the privacy and computation of health data on the untrusted cloud servers, we adopt homomorphic encryption as the main encryption primitive to carry out our research. Eventually, we make the following three contributions: (1) Because the efficiency of Cheon et al.'s approximate homomorphic encryption scheme [16] is better than that of fully homomorphic encryption (FHE), we use it to design secure computation protocols, namely, homomorphic comparison protocol, homomorphic maximum protocol, homomorphic exponential protocol, and homomorphic division protocol. Based on these protocols, we first design the homomorphic reciprocal of square root protocol, which needs only one approximate computation (2) Based on the proposed secure computation protocols, we design the secure A3C reinforcement learning algorithm for the first time. Then, we use it to implement a secure treatment decision-making algorithm (3) Finally, we simulate the proposed secure computation protocols and algorithms on the personal computer's virtual machine. Then, we demonstrate the efficiency of our secure computation algorithms according to the thorough analysis The layout of this paper is as follows. Section 2 analyzes related work about homomorphic encryption and secure computation of encrypted health data. Preliminaries are presented in Section 3. Section 4 shows related work about secure dynamic treatment regimes on health data. Building blocks are discussed in Section 5. Section 6 describes the proposed privacy-preserving A3C reinforcement learning algorithm and treatment decision-making algorithm. Performance results are shown and analyzed in Section 7. Finally, this paper is concluded in Section 8.

Related Work
In this section, we introduce related work about reinforcement learning, homomorphic encryption, and the computation of encrypted health data, which are described as follows.
Reinforcement learning can be mainly classified as value-based algorithms, policy-based algorithms, and actorcritic algorithms. Value-based algorithms usually compute the optimum cumulative reward and give a suggested policy. As a typical value-based algorithm, Q-learning is used for estimating the utility of the individual pair that consists of a state and an action. Q-learning has been applied for path planning [17,18] in vehicular networks. Policy-based algorithms can evaluate the optimum policy directly. Williams [19] proposed a policy-based algorithm REINFORCE. Actor-critic algorithms combine the advantages of valuebased algorithms and policy-based algorithms. The A3C reinforcement learning algorithm [20] is an actor-critic algorithm. It can work in discrete action spaces as well as continuous action spaces [21].
The concept of homomorphic encryption begins from privacy homomorphism [15]. According to the types of supported homomorphic operations, homomorphic encryption can be divided into partial homomorphic encryption (PHE), somewhat homomorphic encryption (SWHE), and FHE. PHE only supports homomorphic addition or homomorphic multiplication. SWHE is the basis of FHE. SWHE supports finite homomorphic addition and homomorphic multiplication. FHE supports arbitrary homomorphic addition and homomorphic multiplication.
In 2009, Gentry [22] designed the first FHE scheme, which is based on ideal lattices. Since then, homomorphic encryption has become a research hotspot. Next, in order to improve the efficiency of homomorphic operations, Gentry et al. [23] first constructed the FHE scheme, which is based on the approximate eigenvector method. In this scheme, the ciphertext noise increases linearly after each homomorphic multiplication. Although homomorphic multiplication of this scheme is efficient, it does not support the technique of single instruction multiple data (SIMD) [24]. Then, based on the learning with errors over rings (RLWE) [25] assumption and relinearization technique [24], Brakerski et al. [24] designed a FHE scheme, which supports the SIMD technique. However, this scheme does not support approximate homomorphic operations. Hence, based on Brakerski et al.'s scheme [24], Cheon et al. [16] proposed an improved homomorphic encryption scheme.
In terms of the computation of encrypted health data by using homomorphic encryption, there exist following several schemes. Khedr and Gulak [26] first proposed an optimized Wireless Communications and Mobile Computing homomorphic encryption scheme, which is based on Gentry's scheme [23]. Then, the proposed scheme is applied for secure medical computations, which include comparison, Pearson goodness-of-fit test, and logistic regression. Sun et al. [27] implemented secure average heart rate, long QT syndrome detection, and chi-square tests by using Dowlin et al.'s FHE scheme [28]. Based on Boneh et al.'s homomorphic encryption scheme [29], Poon et al. [30] implemented the secure Fisher's exact test algorithm, which is often used to guarantee the statistical stability of genetic analysis. Raisaro et al. [31] used homomorphic encryption to explore genomic cohorts securely in a real scenario. In 2019, based on the distributed two trapdoors public-key algorithm [32] and Q-learning algorithm, Liu et al. [2] constructed a secure reinforcement learning model, which is helpful for making treatment decisions dynamically. Based on Fan's SWHE scheme [33], Jiang et al. [34] performed secure and efficient feature point detection and image matching for retinal images of diabetic retinopathy.
However, most of the above schemes are based on PHE, which only supports homomorphic addition or homomorphic multiplication. PHE may not support homomorphic multiplication. If homomorphic multiplication is required in some schemes, excessive rounds of interactions are needed. FHE can avoid this problem. But FHE confronts the problem of the efficiency of homomorphic operations. Furthermore, the Q-learning algorithm is usually used as the reinforcement learning algorithm in the above schemes. There does not exist an approach that can implement the A3C reinforcement learning algorithm securely.

Preliminaries
In this section, we begin with basic notations and definition of Cheon et al.'s approximate homomorphic encryption scheme. Then, we give the introduction of the A3C algorithm.
Let R = ℤ/hΦ m ðxÞi denote the ring modulo Φ m ðxÞ, where λ is the security parameter, m is a positive integer, and Φ m ðxÞ is the mth cyclotomic polynomial. R q = ℤ q ½x/h Φ m ðxÞi represents the ring modulo q and Φ m ðxÞ, where q is the prime modulus, q ≥ 2.
As for an integer h > 0, the distribution HW T ðhÞ is selected from f0,±1g randomly with the Hamming weight h. As for a rational number σ > 0, the distribution DGðσ 2 Þ outputs a vector, which coefficients are selected from the discrete Gaussian distribution with the variance σ 2 . As for a rational number 0 ≤ ρ ≤ 1, the distribution ZOðρÞ is chosen from f0,±1g randomly, where ρ/2 is the probability that ±1 is selected and 1 − ρ is the probability that 0 is selected.

3.2.
Learning with Errors over Rings. In 2010, Lyubashevsky et al. [25] first proposed the RLWE assumption, which is described as follows.
Definition 1 (RLWE). The RLWE λ,q,χ assumption is to distinguish two distributions, namely, ða, a · s + eÞ ∈ R q × R q and ða, cÞ ∈ Unif ðR q × R q Þ, where a ∈ R q and s ∈ R q , e is an error term, and Unif represents uniform random. Lyubashevsky et al. [25] proved that the security of RLWE assumption relies on ideal lattices. (i) AHE:KeyGenð1 λ , p, LÞ: given the security parameter λ, an integer p, and a level L, this algorithm first sets q l = p l · q 0 , where q 0 is a fixed integer, l = L, ⋯, 1. It selects a power-of-two integer M = Mðλ, q L Þ, an integer P = Pðλ, q L Þ, and a rational number σ = σðλ, q L Þ. Next, it chooses a vector s from H W T ðhÞ. The secret key sk is set as ð1, sÞ. A ring element a is sampled from R q L . An error term e is sampled from DGðσ 2 Þ. The public key pk is set as ðb, aÞ ∈ R 2 q L , where b = −a · s + eðmod q L Þ. Then, a ring element a ′ is sampled from R P·q L . An error term e ′ is sampled from DGðσ 2 Þ. The evaluation key evk is set as (ii) AHE:Encðpk, mÞ: in order to encrypt a plaintext m , this algorithm samples an integer v from ZOð 0:5Þ. In addition, it chooses two error terms e 0 and e 1 from DGðσ 2 Þ. m is encrypted as the ciphertext c = v · pk + ðm + e 0 , e 1 Þðmod q L Þ (iii) AHE:Decðsk, cÞ: in this algorithm, c = ðb, aÞ is decrypted as b + a · sðmod q l Þ (iv) AHE:Addðc ′ , c ′ ′Þ: in this algorithm, input parameters include two ciphertexts c ′ = ð½c ′ 0 q l , ½c ′ 1 q l Þ and c ′ ′ = ð½c ′ ′ 0 q l , ½c ′ ′ 1 q l Þ, which are under the same secret key. Then, the additive ciphertext c add = (v) AHE:Mulðevk, c ′ , c ′ ′Þ: in this algorithm, input parameters include evk, two ciphertexts c ′ = ð ½c′ 0 q l , ½c′ 1 q l Þ and c′′ = ð½c′ ′ 0 q l , ½c′ ′ 1 q l Þ, where c′ and c ′ ′ are under the same secret key. Then, the ciphertext The multiplicative ciphertext c mul = ðc 0 , c 1 Þ + ½P −1 · c 2 · evkðmod q l Þ (vi) AHE:ReScale l⟶l ′ ðcÞ: as for a ciphertext c ∈ R 2 q l at the level l, the new ciphertext c ′ = ½ðq l ′ /q l Þcðmod q′ l Þ for the correctness of decryption. In addition, the noise of the rescaling ciphertext is at most Þ. Furthermore, the noise of the multiplicative ciphertext should be less than P −1 · q l · 8σ · N/ ffiffi ffi The details about the analysis of Cheon et al.'s scheme can be found in [16].

Asynchronous Advantage Actor-Critic Reinforcement
Learning Algorithm. In 2016, Mnih et al. [20] proposed the asynchronous advantage actor-critic reinforcement learning algorithm, which is based on combining the value-based method and the policy-based method. One advantage of the A3C algorithm is that it can work in discrete action spaces as well as continuous action spaces. In addition, in order to improve the learning efficiency of the A3C algorithm, multiple asynchronous actor-learners, which can interact with the environment and acquire various independent exploration policies, are running in parallel. The details of the A3C algorithm are described as follows.
In the A3C algorithm, there is a policy function πða t | s t ; θÞ and a value function Vðs t ; θ v Þ, where a t denotes an action at the time step t, s t denotes a state at the time step t, and θ and θ v are two parameters. In addition, Vðs t ; θ v Þ and πða t | s t ; θÞ will be updated t max times, where t max denotes the maximum step. Vðs t ; θ v Þ and πða t | s t ; θÞ are usually approximated by a single convolutional neural network. Specifically, Vðs t ; θ v Þ is based on a linear layer. πða t | s t ; θÞ is relied on a softmax layer. Namely, Vðs t ; θ v Þ = xðs t Þ · θ v , where xðs t Þ is a function which is related to s t . πða t | s t ; θÞ = e f ða t |s t Þ·θ / ∑ t max j=0 e f ða j |s j Þ·θ , a j is an action at the time step j, and f ða j | s j Þ is a function which is related to a j and s j .
Furthermore, the A3C algorithm uses two loss functions, namely, policy loss function and value loss function, which are described as follows. On the one hand, the policy loss function where R is the reward and the parameter k depends on the state. In addition, the upper bound of k is t max . r t+i is the immediate reward. The discount factor γ ∈ ð0, 1. The entropy function Hðπðs t ; θÞÞ can be set as −Σ k i=0 f ða t | s t Þ · θ · ln πðs t ; θÞ. The hyperparameter β can adjust the intensity of the entropy regulation term. Then, we can conclude that Hence, the differentiation of f π ðθÞ with respect to θ is On the other hand, the value loss function Based on the above two loss functions and corresponding differentiation, the A3C reinforcement learning algorithm is defined in Algorithm 1, which is described as follows. Algorithm 1 requires input parameters θ, θ v , θ′, θ′ v , T, t, t max , T max , t g , η, W, and α, where the definition of these parameters are shown in Table 1. In order to implement Algorithm 1, we first set T = 0, t = 1. If T < T max and w ∈ ½1, W, we implement the iteration, which is shown as follows. Global gradients dθ and dθ v are set as 0. θ ′ and θ′ v are synchronized as θ and θ v , respectively. We set t 0 = t and obtain the system state S t ∈ S, where S is a state set, S = ðS 0 ,⋯,S φ−1 Þ, and φ is the number of states. Next, we repeat a subalgorithm until t − t 0 ≠ t max . In this subalgorithm, the action A t ∈ A is obtained by using πðA t | S t ; θ ′ Þ, where A is an action set, A = ðA 0 ,⋯,A χ−1 Þ, and χ is the number of actions. We execute A t , get the reward R t , and observe the next state S t+1 , where R t is set as In addition, we set t = t + 1. After the implementation of the above subalgorithm, we observe whether t%t g equals to 0.

Secure Dynamic Treatment Regimes on
Health Data 4.1. System Model. As shown in Figure 1, the system model of secure dynamic treatment regimes on health data consists of four parts, namely, undiagnosed patient, key generation center, cloud servers, and historical data owners, which are described as follows: (i) The undiagnosed patient's current state is collected by using wearable devices, which integrate modules of physiological sensors, weak computation, and communication. Wearable devices include smart bracelet, smart glasses, sleep monitoring sensors, and smart watch. They can collect a variety of health data, such as body temperature, heart rate, blood sugar, and blood volume index. Then, these health

End for End while
Algorithm 1: A3C reinforcement learning algorithm. In this paper, we suppose that the entities in the system model are honest-but-curious. Namely, the entities strictly follow the designed protocols. But they are interested in acquiring medical data of other entities. We suppose that there is an adversary A * 1 in the attack model.
The goal of A * 1 is to guess the plaintexts of the challenge historical data owners' ciphertexts or the challenge wearable devices' ciphertexts.
In order to acquire the ciphertexts of historical data owners and wearable devices, middle ciphertext results during the execution of privacy-preserving A3C reinforcement learning algorithm and treatment decision-making algorithm (Section 6), A * 1 eavesdrops on the communication links among the entities in the system model. However, these ciphertexts are based on Cheon et al.'s approximate homomorphic encryption scheme [16]. Hence, A * 1 cannot decrypt these ciphertexts without knowing their secret keys. It can be guaranteed by using the semantic security of Cheon et al.'s scheme. In addition, the key generation center distributes key pairs to historical data owners and wearable devices in a secure way. Furthermore, due to the lack of private keys of these ciphertexts, A * 1 cannot generate evaluation keys. Hence, A * 1 cannot transform these ciphertexts into some domains that A * 1 can decrypt. Besides, A * 1 cannot get useful information by adding or multiplying a plaintext with these ciphertexts. In a conclusion, the proposed model is secure.

System Setup and
Overview. Our secure model of dynamic treatment regimes consists of two phases, which are described as follows.
(i) Training dataset outsourcing and initialization: historical data owners initialize input parameters θ, θ v ,   Wireless Communications and Mobile Computing learning rate η, and discount factor γ. The state set S = ðS 0 ,⋯,S φ−1 Þ and action set A = ðA 0 ,⋯,A χ−1 Þ are encrypted as c S = ðc S 0 ,⋯,c S φ−1 Þ and c A = ðc A 0 ,⋯ ,c A χ−1 Þ. Then, historical data owners send c S , c A , η, γ, and other parameters to cloud servers for storage and computation, where j = 0, ⋯, n − 1, and n is the number of historical data owners (ii) Outsourced sequential treatment decision making: in order to achieve sequential treatment decision making, the undiagnosed patient's current state x, which comes from wearable devices, is encrypted as c x . Then, c x is transmitted to cloud servers for treatment decision making. Based on our privacypreserving A3C reinforcement learning algorithm and treatment decision-making algorithm, cloud servers output the encrypted treatment decision c a . The undiagnosed patient decrypts c a to obtain the treatment decision a by using his own secret key

Encoding Rational Number.
In order to implement the privacy-preserving A3C reinforcement learning algorithm, we need to encrypt health data. Health data are usually rational numbers. However, most of the homomorphic encryption schemes only support homomorphic operations over integers. They cannot cope with rational numbers. Hence, in this paper, we take Cheon et al.'s encoding technique [16], which can encode a rational number. Then, the rational number can be converted to a ring element just like using the integer encoding technique. We can use Cheon et al.'s scheme [16] to encrypt the converted result.

Homomorphic Comparison Protocol.
In order to implement comparison for our secure computation algorithms, we design a homomorphic comparison protocol by using Cheon et al.'s scheme [16] and Sun et al.'s method [35]. We suppose that the user owns plaintexts m 0 and m 1 . Then, the user uses Cheon et al.'s scheme to encrypt these plaintexts. The ciphertexts are c 0 and c 1 , respectively. The user owns the secret key sk. The cloud server is responsible for storing the ciphertexts. As shown in Algorithm 2, the cloud server first computes the ciphertext c b = t + c 0 − c 1 , where t is the plaintext modulus. Next, the cloud server transmits c b to the user. The user uses sk to decrypt c b . The decryption

Homomorphic Maximum Protocol.
In order to compute the encrypted index of the largest plaintext, we design a homomorphic maximum protocol by using the above protocol and Sun et al.'s method [35]. We suppose that the user owns m 0 , ⋯, m k−1 , where k is the number of plaintexts. The user uses Cheon et al.'s scheme [16] to encrypt these plaintexts. The ciphertexts are c 0 , ⋯, c k−1 , respectively. The user owns the secret key sk. As shown in Algorithm 3, the cloud server computes The cloud server continues to compare c max and c i until i > k − 1. Finally, the user can obtain the index of the largest plaintext by decrypting c max . For example, there exist ciphertexts c 2 , c 3 , and c 4 , whose plaintexts are 2, 3, and 4, respectively. We set c max = c 2 . Next, we first compare c max and c 3 . c max is updated as c 3 . Then, we compare c max and c 4 . c max is updated as c 4 . After the decryption of c max , the user gets the maximum result 4.

Homomorphic Exponential Protocol.
In this section, based on the Taylor series, we begin to describe the homomorphic exponential protocol. We suppose that the user owns the plaintext m. Next, it is encrypted as c m by using Cheon et al.'s homomorphic encryption scheme. Only the user has the secret key. Then, c m is stored on the cloud server. In the homomorphic exponential protocol (Algorithm 4), the cloud server first computes the ciphertext c e m = 1 + c m + c 2 m /2!+c 3 m /3!+ ⋯ + c n m /n! without decryption, where n denotes an integer. The precision of e m increases with the increasing of n. Then, c e m is returned to the user. The user gets the exponential result e m by using his secret key. For example, we can set m = 4, n = 3, and then c e 2 = 1 + c 4 + c 2 4 /2!+c 3 4 /3!, where c e 2 and c 4 are ciphertexts of e 2 and 4, respectively. After the decryption of c e 2 , the user gets the exponential result e 2 .

Homomorphic Division Protocol.
In this section, we begin to describe the homomorphic division protocol. We suppose that the user owns plaintexts m 0 , m 1 , and m 2 . Then, they are encrypted as c m 0 , c m 1 , and c m 2 by using Cheon et al.'s compðc 0 , c 1 Þ: Input:c 0 and c 1 .

End for
Algorithm 3: Homomorphic maximum protocol. 7 Wireless Communications and Mobile Computing homomorphic encryption scheme. Only the user has the secret key. Then, c m 0 , c m 1 , and c m 2 are transmitted to the cloud server. In order to output the ciphertext c div of the plaintext m 2 /ðm 0 + m 1 Þ, we design the homomorphic division protocol (Algorithm 5), which is described as follows. The cloud server first computes the ciphertext c add = c m 0 + c m 1 without decryption, where the plaintext of c add is add = m 0 + m 1 . Then, c add is returned to the user. The user gets the plaintext add by using his secret key. The user calculates rev = 1/add. rev is encrypted as c rev by using Cheon et al.'s scheme. c rev is transmitted to the cloud server. The cloud server calculates the ciphertext c div = c m 2 · c rev . Finally, c div is returned to the user. After the decryption of c div , the user gets the division result div = m 2 /ðm 0 + m 1 Þ. For example, we set m 0 = 6, m 1 = 1, and m 2 = 2. Then, the cloud server calculates c add = c m 0 + c m 1 = c 3 , where the plaintext of c 3 is 3. c 3 is returned to the user. After the decryption of c 3 , the user calculates rev = 1/3 = 0:33. The ciphertext c rev is sent to the cloud server. The cloud server calculates c div = c m 2 · c rev = c 0:33×6 = c 1:98 , where the plaintext of c 1:98 is 1.98. 5.6. Homomorphic Reciprocal of Square Root Protocol. In this section, we begin to describe the homomorphic reciprocal of square root protocol. The traditional method is to compute the ciphertext of the approximate square root firstly. Then, it computes the approximate reciprocal of square root homomorphically. However, two approximate computations will affect the precision of the final result. Hence, based on Lomont's fast inverse square root algorithm [36], we design a new homomorphic reciprocal of square root protocol (Algorithm 6), which only needs one approximate computation. In our protocol, we suppose that the user owns the floating number m = ð1 + m 0 Þ2 m 1 −127 , where 0 < m 0 < 1 and 0 < m 1 < 255. It is encrypted as c m by using Cheon et al.'s homomorphic encryption scheme. Only the user has the secret key. c m is stored on the cloud server. c m is first transmitted to the user. The user decrypts c m by his secret key. The decryption result m is converted to an integer m ′ = m 1 · 2 23 + m 0 · 2 23 . m ′ is encrypted as c m ′ by using Cheon et al.'s scheme. Next, c m ′ is transmitted to the cloud server. The cloud server computes the intermediate ciphertext where the plaintext of c temp is the floating number temp. Then, the cloud server sends c temp to the user. The user decrypts c temp to obtain temp = temp 1 · 2 23 + temp 0 · 2 23 , where 0 < temp 0 < 1 and 0 < temp 1 < 255. temp is converted to an integer temp ′ = ð1 + temp 0 Þ2 temp 1 −127 . temp ′ is encrypted as c temp ′ . c temp ′ is transmitted to the cloud server. The cloud server computes the ciphertext Then, c 1/ ffiffiffi m p is returned to the user. The user gets the reciprocal of square root 1/ ffiffiffiffi m p by using his secret key. For example, we set m = ð1 + 0:25Þ 124−127 = 0:156. Then, m is converted to m′ = 62 · 2 23 + 0:125 · 2 23 . The ciphertext c m ′ of m ′ is sent to the cloud server. The cloud server computes The user decrypts c temp to obtain temp = 128 · 2 23 + 0:3075 · 2 23 . temp is converted to temp ′ = ð1 + 0:3075Þ 2 128−127 = 2:615. The ciphertext c temp ′ of temp ′ is sent to 6. c rev is transmitted to the cloud server. 7. Calculate c div = c m 2 · c rev . 8. c div is returned to the user.

Privacy-Preserving Computation Algorithms
6.1. Privacy-Preserving A3C Reinforcement Learning Algorithm. In this section, we begin to describe how to implement a privacy-preserving A3C reinforcement learning algorithm by using Cheon et al.'s approximate homomorphic encryption scheme [16]. As shown in Algorithm 7, we describe the privacy-preserving A3C reinforcement learning algorithm as follows. Algorithm 7 requires input parameters θ, θ v , θ′, θ′ v , T, t, t max , T max , t g , η, W, α, c S , and c A . Set T = 0, t = 1. If T < T max and w ∈ ½1, W, we implement the iteration, which is shown as follows. dθ and dθ v are set as 0. θ ′ and θ ′ v are synchronized as θ and θ v , respectively. We set t 0 = t and obtain the encrypted system state c S t ∈ c S . Next, we repeat a subalgorithm until t − t 0 ≠ t max . In this subalgorithm, the encrypted action c A t ∈ c A is obtained based on argmaxðc π 0 , c π 1 ,⋯,c π t max Þ, where c π j = e f ðc A j |c S j Þ·θ′ , j = 0, 1, ⋯, t max . We execute c A t and get the encrypted reward We observe the next encrypted state c S t+1 ∈ c S . In addition, Get c A t according to argmaxðc π 0 ,⋯,c π t max Þ. Execute End for c′ dθ and c′ dθ v are returned to the cloud server.
c′ g and c′ g v are sent to the cloud server.
c θ and c θ v are returned to the user. Decrypt c θ and c θ v to obtain θ and θ v . End for End while Algorithm 7: Privacy-preserving A3C reinforcement learning algorithm. 9 Wireless Communications and Mobile Computing we set t = t + 1. After the implementation of the above subalgorithm, we observe whether c S t is terminal. Then, we repeat a subalgorithm from i = t − 1 to i = t 0 . In this subalgorithm, if c S t is terminal, c R is set as The cloud server computes The cloud server computes If c S t is nonterminal, c R is set as The cloud server computes where c ∂f π ðθ ′ Þ/∂θ ′ is the ciphertext of ∂f π ðθ ′ Þ/∂θ ′ . The cloud server computes where is set as c dθ + c ∂f π ðθ ′Þ/∂θ′. c dθ v is set as c dθ v + c ∂f π ðθ ′ v Þ/∂θ ′ v . The cloud server computes c g = αc g + ð1 − αÞðc dθ Þ 2 and c g v = αc g v The cloud server sends c dθ and c dθ v to the user. The user decrypts c dθ and c dθ v to obtain dθ and dθ v . dθ and dθ v are encrypted as c′ dθ and c′ dθ v . c′ dθ and c′ dθ v are returned to the cloud server. In order to reduce the depth of homomorphic multiplication for the calculation of c g and c g v , the cloud server sends c g and c g v to the user. The user decrypts c g and c g v to obtain g and g v . g and g v are encrypted as c ′ g and c′ g v . The user sends c′ g and c′ g v to the cloud server. The cloud server sets c g = c ′ g and c g v = c ′ g v . Finally, based on the above homomorphic reciprocal of square root protocol, θ and θ v can be updated by using equations respectively. After the execution of Algorithm 7, we can get the encrypted optimized parameters c θ and c θ v , which are returned to the user. The user decrypts c θ and c θ v to obtain θ and θ v , which can be used the implementation of secure treatment decision-making algorithm.

Wireless Communications and Mobile Computing
Set t = 2. Because S 2 is nonterminal, the cloud server com- Then, the cloud server computes where The cloud server computes where ∂f π θ′ The cloud server computes where Set c dθ = 0, c dθ v = 0. The cloud server computes c dθ = c dθ + c ∂f π ðθ ′ Þ/∂θ ′ , where We compute With the help of the user, the cloud server gets refreshed ciphertexts c dθ v and c dθ v . The cloud server computes c g = αc g + ð1 − αÞðc dθ Þ 2 , where The cloud server computes After the cloud server obtains ciphertexts c ′ g and c ′ g v , we set c g = c ′ g and c g v = c ′ g v . The cloud server computes c θ = θ − ηðc = 0:5 − 0:1 0:0013 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 0:000000845 + 0:01 p ≈ 0:4987: The cloud server computes The user can obtain refreshed θ ≈ 0:4987 and θ v ≈ 0:4698 by using his secret key.
6.2. Secure Treatment Decision-Making Algorithm. In this subsection, based on the above privacy-preserving A3C reinforcement learning algorithm, secure treatment decisionmaking algorithm TDMðθ, θ v , c x , c S , c A Þ is implemented in Algorithm 8, which is described as follows. In this algorithm, input parameters include θ, θ v , and the undiagnosed patient's encrypted current state c x , c S , and c A . Set the index col = 0. The ciphertext c i,j is initiated, where i = 0, ⋯, χ − 1 and j = 0, ⋯, φ − 1. Firstly, c S 's element c S j is compared with c x by using the above homomorphic comparison protocol where the plaintext of c π,i is πðA i | S col Þ. The cloud server computes homomorphic multiplication between c v and c π,i , namely, Algorithm 8: Secure treatment decision-making algorithm. Table 3: The distribution of the encrypted probability.

Encrypted probability
Encrypted actions Encrypted states

12
Wireless Communications and Mobile Computing ciphertext of VðS col ; θ v ÞπðA i | S col Þ, i = 0, ⋯, χ − 1. Set index = 0. Finally, the cloud server computes the ciphertext c hor,col by using the homomorphic maximum protocol argmaxðc 0 ,⋯,c χ−1 Þ. Set index = hor. Hence, the treatment decision is c dr = c A index . Then, c dr will be transmitted to the undiagnosed patient. In order to obtain the treatment decision dr, c dr can be decrypted by using his own secret key.

Performance Results
In this section, based on Cheon et al.'s homomorphic encryption scheme, we analyze the efficiency of our secure computation protocols, secure A3C reinforcement learning algorithm, and secure treatment decision-making algorithm. We use the virtual machine to implement experiments without the GPU hardware platform. In our experimental environment, the operating system is macOS 10.14.6. Our personal computer has two Intel (R) Core (TM) i5 CPU processors, which runs at 2.3 GHz with 8.00 GB RAM. The operation system of a virtual machine is ubuntu 16.04. The virtual machine is allocated single Intel (R) Core (TM) i5 CPU processor with 1.0 GB RAM. In order to implement high-level numeric algorithms, we choose the NTL library. We use the GCC platform to compile our C++ codes. We adopt the UC Irvine Machine Learning Repository (http:// archive.ics.uci.edu/ml/index.php) for implementing the experiments. For convenience, we set log q ranging from 400 to 700, the scaling factor log p = 30. Figure 2 shows the efficiency of our homomorphic comparison protocol, where the number of comparison oc ranges from 1 to 4. As shown in Figure 2, the running time of our homomorphic comparison protocol increases significantly with the increasing of oc. 14 Wireless Communications and Mobile Computing of maximum om ranges from 1 to 4 and the number of plaintexts k ranges from 5 to 7. It can be easily observed that the running time of our homomorphic maximum protocol increases significantly with the increasing of k and om. Figures 6-8 show the efficiency of our homomorphic exponential protocol, where the number of exponential operation oe ranges from 1 to 4 and the integer n ranges from 2 to 4. We can observe that the running time of our homomorphic exponential protocol increases rapidly with the increasing of oe and n. Then, Figure 9 shows the efficiency of our homomorphic division protocol, where the number of division od ranges from 1 to 4. We can observe the changing trend of the running time of our homomorphic division protocol. This protocol has an obvious growth of running time with the increasing of od and log q. Figure 10 shows the efficiency of our homomorphic reciprocal of square root protocol, where os ranges from 1 to 4; os denotes the number of operations of reciprocal of square root. With the increasing of os and log q, more running time is needed for implementing our homomorphic reciprocal of square root protocol. It can be observed that its running time is longer than the above homomorphic comparison, maximum, exponential, and division protocols. Figure 11 shows the efficiency of our secure A3C reinforcement learning algorithm, where ot ranges from 1 to 4; ot denotes the number of operations of A3C training algorithm. With the increasing of ot and log q, our A3C reinforcement learning algorithm requires more running time. This algorithm is responsible for training the parameters θ and θ v . Hence, this algorithm is complicated. We can observe too much running time is needed for this algorithm, which can demonstrate the above viewpoint. Figure 12 shows the efficiency of our secure treatment decision-making algorithm, where dm ranges from 1 to 4; dm denotes the number of operations of treatment decision-making algorithm. The running time of this algorithm grows with the increasing of dm. This algorithm uses the optimized θ and θ v . Hence, this algorithm is less complicated than the secure A3C algorithm. The running time of this algorithm is shorter than the secure A3C algorithm, which can verify the above viewpoint. In a conclusion, the above efficiency analysis shows the feasibility of our secure computation protocols and algorithms.

Conclusion
Reinforcement learning is helpful for implementing dynamic treatment regimes on health data. However, private health data may be illegally leaked, falsified, or deleted in the execution of the reinforcement learning algorithm. Hence, we study secure dynamic treatment regimes on health data. In this paper, we have designed homomorphic comparison protocol, homomorphic maximum protocol, homomorphic exponential protocol, homomorphic division protocol, and homomorphic reciprocal of square root protocol. Based on these secure computation protocols, we have proposed a privacy-preserving A3C reinforcement learning algorithm for the first time. Then, it is used for implementing the secure treatment decision-making algorithm. Finally, we simulate the proposed secure computation protocols and algorithms. Simulation results show that our secure computation protocols and algorithms are feasible.
In the future research, we will use homomorphic encryption to implement other machine learning algorithms, such as distributed learning [37] and federated reinforcement learning [38], which can successfully dominate multiple real devices that have the same type and slightly different dynamics. In addition, we plan to evaluate the performance of the secure A3C algorithm in other real-world scenarios, for example, vehicular ad hoc network.

Data Availability
The data of secure computation protocols and algorithms used to support the findings of this study are available from the corresponding author upon request.