Automatic caption generation for news images
This thesis is concerned with the task of automatically generating captions for images, which is important for many image-related applications. Automatic description generation for video frames would help security authorities manage more efficiently and utilize large volumes of monitoring data. Image search engines could potentially benefit from image description in supporting more accurate and targeted queries for end users. Importantly, generating image descriptions would aid blind or partially sighted people who cannot access visual information in the same way as sighted people can. However, previous work has relied on fine-gained resources, manually created for specific domains and applications In this thesis, we explore the feasibility of automatic caption generation for news images in a knowledge-lean way. We depart from previous work, as we learn a model of caption generation from publicly available data that has not been explicitly labelled for our task. The model consists of two components, namely extracting image content and rendering it in natural language. Specifically, we exploit data resources where images and their textual descriptions co-occur naturally. We present a new dataset consisting of news articles, images, and their captions that we required from the BBC News website. Rather than laboriously annotating images with keywords, we simply treat the captions as the labels. We show that it is possible to learn the visual and textual correspondence under such noisy conditions by extending an existing generative annotation model (Lavrenko et al., 2003). We also find that the accompanying news documents substantially complements the extraction of the image content. In order to provide a better modelling and representation of image content,We propose a probabilistic image annotation model that exploits the synergy between visual and textual modalities under the assumption that images and their textual descriptions are generated by a shared set of latent variables (topics). Using Latent Dirichlet Allocation (Blei and Jordan, 2003), we represent visual and textual modalities jointly as a probability distribution over a set of topics. Our model takes these topic distributions into account while finding the most likely keywords for an image and its associated document. The availability of news documents in our dataset allows us to perform the caption generation task in a fashion akin to text summarization; save one important difference that our model is not solely based on text but uses the image in order to select content from the document that should be present in the caption. We propose both extractive and abstractive caption generation models to render the extracted image content in natural language without relying on rich knowledge resources, sentence-templates or grammars. The backbone for both approaches is our topic-based image annotation model. Our extractive models examine how to best select sentences that overlap in content with our image annotation model. We modify an existing abstractive headline generation model to our scenario by incorporating visual information. Our own model operates over image description keywords and document phrases by taking dependency and word order constraints into account. Experimental results show that both approaches can generate human-readable captions for news images. Our phrase-based abstractive model manages to yield as informative captions as those written by the BBC journalists.