Unsupervised induction of semantic roles
In recent years, a considerable amount of work has been devoted to the task of automatic frame-semantic analysis. Given the relative maturity of syntactic parsing technology, which is an important prerequisite, frame-semantic analysis represents a realistic next step towards broad-coverage natural language understanding and has been shown to benefit a range of natural language processing applications such as information extraction and question answering. Due to the complexity which arises from variations in syntactic realization, data-driven models based on supervised learning have become the method of choice for this task. However, the reliance on large amounts of semantically labeled data which is costly to produce for every language, genre and domain, presents a major barrier to the widespread application of the supervised approach. This thesis therefore develops unsupervised machine learning methods, which automatically induce frame-semantic representations without making use of semantically labeled data. If successful, unsupervised methods would render manual data annotation unnecessary and therefore greatly benefit the applicability of automatic framesemantic analysis. We focus on the problem of semantic role induction, in which all the argument instances occurring together with a specific predicate in a corpus are grouped into clusters according to their semantic role. Our hypothesis is that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences, by leveraging the syntactic relations conveyed through parse trees with lexical-semantic information. We argue that semantic role induction can be guided by three linguistic principles. The first is the well-known constraint that semantic roles are unique within a particular frame. The second is that the arguments occurring in a specific syntactic position within a specific linking all bear the same semantic role. The third principle is that the (asymptotic) distribution over argument heads is the same for two clusters which represent the same semantic role. We consider two approaches to semantic role induction based on two fundamentally different perspectives on the problem. Firstly, we develop feature-based probabilistic latent structure models which capture the statistical relationships that hold between the semantic role and other features of an argument instance. Secondly, we conceptualize role induction as the problem of partitioning a graph whose vertices represent argument instances and whose edges express similarities between these instances. The graph thus represents all the argument instances for a particular predicate occurring in the corpus. The similarities with respect to different features are represented on different edge layers and accordingly we develop algorithms for partitioning such multi-layer graphs. We empirically validate our models and the principles they are based on and show that our graph partitioning models have several advantages over the feature-based models. In a series of experiments on both English and German the graph partitioning models outperform the feature-based models and yield significantly better scores over a strong baseline which directly identifies semantic roles with syntactic positions. In sum, we demonstrate that relatively high-quality shallow semantic representations can be induced without human supervision and foreground a promising direction of future research aimed at overcoming the problem of acquiring large amounts of lexicalsemantic knowledge.