Paper 2024/659

Secure Latent Dirichlet Allocation

Thijs Veugen, TNO, University of Twente
Vincent Dunning, TNO
Michiel Marcus, TNO
Bart Kamphorst, TNO
Abstract

Topic modelling refers to a popular set of techniques used to discover hidden topics that occur in a collection of documents. These topics can, for example, be used to categorize documents or label text for further processing. One popular topic modelling technique is Latent Dirichlet Allocation (LDA). In topic modelling scenarios, the documents are often assumed to be in one, centralized dataset. However, sometimes documents are held by different parties, and contain privacy- or commercially-sensitive information that cannot be shared. We present a novel, decentralized approach to train an LDA model securely without having to share any information about the content of the documents with the other parties. We preserve the privacy of the individual parties using a combination of privacy enhancing technologies. We show that our decentralized, privacy preserving LDA solution has a similar accuracy compared to an (insecure) centralised approach. With $1024$-bit Paillier keys, a topic model with $5$ topics and $3000$ words can be trained in around $16$ hours. Furthermore, we show that the solution scales linearly in the total number of words and the number of topics.

Metadata
Available format(s)
PDF
Category
Applications
Publication info
Preprint.
Keywords
Latent Dirichlet AllocationSecure multi-party computation
Contact author(s)
thijs veugen @ tno nl
vincent dunning @ tno nl
michiel marcus @ tno nl
bart kamphorst @ tno nl
History
2024-05-02: approved
2024-04-29: received
See all versions
Short URL
https://ia.cr/2024/659
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2024/659,
      author = {Thijs Veugen and Vincent Dunning and Michiel Marcus and Bart Kamphorst},
      title = {Secure Latent Dirichlet Allocation},
      howpublished = {Cryptology {ePrint} Archive, Paper 2024/659},
      year = {2024},
      url = {https://eprint.iacr.org/2024/659}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.