Introduction
We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy.
We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation.
The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs).
For details, please read our arXiv report.
Publication
Do-kyum Kim, Geoffrey M. Voelker and Lawrence K. Saul.
Topic Modeling of Hierarchical Corpora.
Submitted for publication.
[arXiv]
Do-kyum Kim, Geoffrey M. Voelker and Lawrence K. Saul.
A Variational Approximation for Topic Modeling of Hierarchical Corpora.
In Proceedings of the 30th International Conference on Machine Learning (ICML 2013). Atlanta, GA.
[paper]
[supplement]
[bib]
People
Source code
You can find our implementation at GitHub .
Data sets
These are the data sets we used in the paper.
Flat corpora:
[KOS]
[Enron]
[SubsetOfNYTimes]
Hierarchical corpora:
[NIPS]
[Freelancer]
[BlackHatWorld]
Format of the data sets
The data sets above are represented in the following format.
First, the hierarchies in corpora are described in 'hier_tree_structures.txt'.
The first line of the file has the number of categories including the root node.
Then, each line describes individual category.
These lines are composed as:
[category_id]\t[category_id_of_first_child] [category_id_of_second_child] ... -1\n
and the category id of 0 is reserved for the root node.
Second, for each category, there are attached documents, and this mapping is given in 'category_to_docids.txt'.
Each line is composed as:
[document_id1] [document_id2] ... -1\n
where the document ids are 0-based.
The first line corresponds to the root node (i.e. the category id of 0) and the second line corresponds to the category id of 1 and so on.
Lastly, the document-term matrixes are given in 'document_term_matrix.txt'.
In some data sets, there is only one 'document_term_matrix.txt' that are shared over different folds and splits;
in the others, each fold and split has separate 'document_term_matrix.txt'.
Each of 'document_term_matrix.txt' is composed as:
[# documents]
[# dictinct terms]
[# dictinct terms in the first document]
[term_id1]:[# occurrences]
[term_id2]:[# occurrences]
...
[# dictinct terms in the second document]
[term_id1]:[# occurrences]
...
Note that the term ids are also 0-based.