Crosslingual Topic Modeling with WikiPDA

Tiziano Piccardi, Robert West

Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA) is a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation showed that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. WikiPDA also has the capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning.

