Paper 2024/759

Enhancing Watermarked Language Models to Identify Users

Aloni Cohen, University of Chicago
Alexander Hoover, University of Chicago
Gabe Schoenbach, University of Chicago
Abstract

A zero-bit watermarked language model produces text that is indistinguishable from that of the underlying model, but which can be detected as machine-generated using a secret key. Unfortunately, merely detecting AI-generated spam, say, as watermarked may not prevent future abuses. If we could additionally trace the text to a spammer's API token or account, we could then cut off their access or pursue legal action. We introduce multi-user watermarks, which allow tracing model-generated text to individual users or to groups of colluding users. We construct multi-user watermarking schemes from undetectable zero-bit watermarking schemes. Importantly, our schemes provide both zero-bit and multi-user assurances at the same time: detecting shorter snippets just as well as the original scheme, and tracing longer excerpts to individuals. Along the way, we give a generic construction of a watermarking scheme that embeds long messages into generated text. Ours are the first black-box reductions between watermarking schemes for language models. A major challenge for black-box reductions is the lack of a unified abstraction for robustness — that marked text is detectable even after edits. Existing works give incomparable robustness guarantees, based on bespoke requirements on the language model's outputs and the users' edits. We introduce a new abstraction to overcome this challenge, called AEB-robustness. AEB-robustness provides that the watermark is detectable whenever the edited text "approximates enough blocks" of model-generated output. Specifying the robustness condition amounts to defining approximates, enough, and blocks. Using our new abstraction, we relate the robustness properties of our message-embedding and multi-user schemes to that of the underlying zero-bit scheme, in a black-box way. Whereas prior works only guarantee robustness for a single text generated in response to a single prompt, our schemes are robust against adaptive prompting, a stronger and more natural adversarial model.

Metadata
Available format(s)
PDF
Category
Applications
Publication info
Preprint.
Keywords
watermarkinglanguage modelsgenerative AIfingerprinting codes
Contact author(s)
aloni @ g uchicago edu
alexhoover @ uchicago edu
gschoenbach @ uchicago edu
History
2024-05-20: approved
2024-05-17: received
See all versions
Short URL
https://ia.cr/2024/759
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2024/759,
      author = {Aloni Cohen and Alexander Hoover and Gabe Schoenbach},
      title = {Enhancing Watermarked Language Models to Identify Users},
      howpublished = {Cryptology ePrint Archive, Paper 2024/759},
      year = {2024},
      note = {\url{https://eprint.iacr.org/2024/759}},
      url = {https://eprint.iacr.org/2024/759}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.