## Cryptology ePrint Archive: Report 2021/099

Property Inference from Poisoning

Melissa Chase and Esha Ghosh and Saeed Mahloujifar

Abstract: A major concern in training and releasing machine learning models is to what extent the model contains sensitive information that the data holders do not want to reveal. Property inference attacks consider an adversary who has access to the trained model and tries to extract some global statistics of the training data. In this work, we study property inference in scenarios where the adversary can maliciously control part of the training data (poisoning data) with the goal of increasing the leakage.

Previous work on poisoning attacks focused on trying to decrease the accuracy of models either on the whole population or on specific sub-populations or instances. Here, for the first time, we study poisoning attacks where the goal of the adversary is to increase the information leakage of the model. Our findings suggest that poisoning attacks can boost the information leakage significantly and should be considered as a stronger threat model in sensitive applications where some of the data sources may be malicious.

We first describe our property inference poisoning attack that allows the adversary to learn the prevalence in the training data of any property it chooses: it chooses the property to attack, then submits input data according to a poisoned distribution, and finally uses black box queries (label-only queries) on the trained model to determine the frequency of the chosen property. We theoretically prove that our attack can always succeed as long as the learning algorithm used has good generalization properties.

We then verify effectiveness of our attack by experimentally evaluating it on two datasets: a Census dataset and the Enron email dataset. In the first case we show that classifiers that recognizes whether an individual has high income (Census data) also leak information about the race and gender ratios of the underlying dataset. In the second case, we show classifiers trained to detect spam emails (Enron data) can also reveal the fraction of emails which show negative sentiment (according to a sentiment analysis algorithm); note that the sentiment is not a feature in the training dataset, but rather some feature that the adversary chooses and can be derived from the existing features (in this case the words). Finally, we add an additional feature to each dataset that is chosen at random, independent of the other features, and show that the classifiers can also be made to leak statistics about this feature; this shows that the attack can target features completely uncorrelated with the original training task. We were able to achieve above $90\%$ attack accuracy with $9-10\%$ poisoning in all of these experiments.

Category / Keywords: applications / Poisoning Attack, Property Inference Attack, Machine Learning, Information Leakage