Bayesian nonparametric models for name disambiguation and supervised learning
Dai, Andrew Mingbo
MetadataShow full item record
This thesis presents new Bayesian nonparametric models and approaches for their development, for the problems of name disambiguation and supervised learning. Bayesian nonparametric methods form an increasingly popular approach for solving problems that demand a high amount of model flexibility. However, this field is relatively new, and there are many areas that need further investigation. Previous work on Bayesian nonparametrics has neither fully explored the problems of entity disambiguation and supervised learning nor the advantages of nested hierarchical models. Entity disambiguation is a widely encountered problem where different references need to be linked to a real underlying entity. This problem is often unsupervised as there is no previously known information about the entities. Further to this, effective use of Bayesian nonparametrics offer a new approach to tackling supervised problems, which are frequently encountered. The main original contribution of this thesis is a set of new structured Dirichlet process mixture models for name disambiguation and supervised learning that can also have a wide range of applications. These models use techniques from Bayesian statistics, including hierarchical and nested Dirichlet processes, generalised linear models, Markov chain Monte Carlo methods and optimisation techniques such as BFGS. The new models have tangible advantages over existing methods in the field as shown with experiments on real-world datasets including citation databases and classification and regression datasets. I develop the unsupervised author-topic space model for author disambiguation that uses free-text to perform disambiguation unlike traditional author disambiguation approaches. The model incorporates a name variant model that is based on a nonparametric Dirichlet language model. The model handles both novel unseen name variants and can model the unknown authors of the text of the documents. Through this, the model can disambiguate authors with no prior knowledge of the number of true authors in the dataset. In addition, it can do this when the authors have identical names. I use a model for nesting Dirichlet processes named the hybrid NDP-HDP. This model allows Dirichlet processes to be clustered together and adds an additional level of structure to the hierarchical Dirichlet process. I also develop a new hierarchical extension to the hybrid NDP-HDP. I develop this model into the grouped author-topic model for the entity disambiguation task. The grouped author-topic model uses clusters to model the co-occurrence of entities in documents, which can be interpreted as research groups. Since this model does not require entities to be linked to specific words in a document, it overcomes the problems of some existing author-topic models. The model incorporates a new method for modelling name variants, so that domain-specific name variant models can be used. Lastly, I develop extensions to supervised latent Dirichlet allocation, a type of supervised topic model. The keyword-supervised LDA model predicts document responses more accurately by modelling the effect of individual words and their contexts directly. The supervised HDP model has more model flexibility by using Bayesian nonparametrics for supervised learning. These models are evaluated on a number of classification and regression problems, and the results show that they outperform existing supervised topic modelling approaches. The models can also be extended to use similar information to the previous models, incorporating additional information such as entities and document titles to improve prediction.