Closing the gap in WSD: supervised results with unsupervised methods
Word-Sense Disambiguation (WSD), holds promise for many NLP applications requiring broad-coverage language understanding, such as summarization (Barzilay and Elhadad, 1997) and question answering (Ramakrishnan et al., 2003). Recent studies have also shown that WSD can benefit machine translation (Vickrey et al., 2005) and information retrieval (Stokoe, 2005). Much work has focused on the computational treatment of sense ambiguity, primarily using data-driven methods. The most accurate WSD systems to date are supervised and rely on the availability of sense-labeled training data. This restriction poses a significant barrier to widespread use of WSD in practice, since such data is extremely expensive to acquire for new languages and domains. Unsupervised WSD holds the key to enable such application, as it does not require sense-labeled data. However, unsupervised methods fall far behind supervised ones in terms of accuracy and ease of use. In this thesis we explore the reasons for this, and present solutions to remedy this situation. We hypothesize that one of the main problems with unsupervised WSD is its lack of a standard formulation and general purpose tools common to supervised methods. As a first step, we examine existing approaches to unsupervised WSD, with the aim of detecting independent principles that can be utilized in a general framework. We investigate ways of leveraging the diversity of existing methods, using ensembles, a common tool in the supervised learning framework. This approach allows us to achieve accuracy beyond that of the individual methods, without need for extensive modification of the underlying systems. Our examination of existing unsupervised approaches highlights the importance of using the predominant sense in case of uncertainty, and the effectiveness of statistical similarity methods as a tool for WSD. However, it also serves to emphasize the need for a way to merge and combine learning elements, and the potential of a supervised-style approach to the problem. Relying on existing methods does not take full advantage of the insights gained from the supervised framework. We therefore present an unsupervised WSD system which circumvents the question of actual disambiguation method, which is the main source of discrepancy in unsupervised WSD, and deals directly with the data. Our method uses statistical and semantic similarity measures to produce labeled training data in a completely unsupervised fashion. This allows the training and use of any standard supervised classifier for the actual disambiguation. Classifiers trained with our method significantly outperform those using other methods of data generation, and represent a big step in bridging the accuracy gap between supervised and unsupervised methods. Finally, we address a major drawback of classical unsupervised systems – their reliance on a fixed sense inventory and lexical resources. This dependence represents a substantial setback for unsupervised methods in cases where such resources are unavailable. Unfortunately, these are exactly the areas in which unsupervised methods are most needed. Unsupervised sense-discrimination, which does not share those restrictions, presents a promising solution to the problem. We therefore develop an unsupervised sense discrimination system. We base our system on a well-studied probabilistic generative model, Latent Dirichlet Allocation (Blei et al., 2003), which has many of the advantages of supervised frameworks. The model’s probabilistic nature lends itself to easy combination and extension, and its generative aspect is well suited to linguistic tasks. Our model achieves state-of-the-art performance on the unsupervised sense induction task, while remaining independent of any fixed sense inventory, and thus represents a fully unsupervised, general purpose, WSD tool.