You are here

How Many Topics? Stability Analysis for Topic Models

Authors: 

Derek Greene, Derek O'Callaghan, Pádraig Cunningham

Publication Type: 
Refereed Conference Meeting Proceeding
Abstract: 
Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the“over-clustering” of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.
Conference Name: 
European Conference on Machine Learning (ECML'14)
Proceedings: 
European Conference on Machine Learning (ECML'14)
Digital Object Identifer (DOI): 
10.na
Publication Date: 
15/09/2014
Conference Location: 
France
Institution: 
National University of Ireland, Dublin (UCD)
Open access repository: 
No
Publication document: