Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size $\Omega(d / \varepsilon^2)$ is an $\varepsilon$-approximation of the full data with good probability, where $d$ is the VC-dimension of the underlying set system $(U, \mathcal{R})$. Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is \emph{negative}: We demonstrate a set system where a constant sample size (corresponding to a VC-dimension of $1$) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack.
However, this attack is ``theoretical only'', requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term $d$ in the sample size with the cardinality term $\log |\mathcal{R}|$. That is, the Bernoulli and reservoir sampling algorithms with sample size $\Omega(\log |\mathcal{R}|/\varepsilon^2)$ output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack.
Category / Keywords: foundations / random sampling, adversarial model, epsilon-approximation, streaming algorithm Date: received 30 Jun 2019 Contact author: eylony at gmail com Available format(s): PDF | BibTeX Citation Version: 20190702:142415 (All versions of this report) Short URL: ia.cr/2019/764