Extracting information from fiction
Information Extraction (IE) based techniques have great potential to enable companies to leverage valuable information embedded in unstructured textual data. Such data could be exploited to help drive sales and to enhance the customer's experience when searching or browsing for products. Extensive research has been performed in the field of IE; however, to date no work has been directly applied to the domain of fiction. The aim of this study is to explore the ability of IE techniques to extract the central characters and their relationships from the full textual content of works of fiction. To begin our investigation, we present a collection of hypotheses outlining our expectations in approaching and resolving these problems. We then outline our data collection process, which resulted in the creation of a Gold Standard containing ordered lists of characters and their relationships for eight classic book texts. For the task of character extraction, we test two rule-based co-reference resolution models, and two ordering techniques. Our best model achieves an average of 100% coverage on the three most important characters and 78.4% across all central characters, compared to a baseline of 73.3% and 57.4% respectively. For the task of relation extraction, we implement rule-based systems to detect the presence and types of relationships between characters. We achieved 73.3% coverage in detecting the top three pairs of characters involved in relationships. The results for inferring relationship types are preliminary. We provide an analysis of relationship mentions in works of fiction and propose a number of approaches for future work.