Show simple item record

dc.contributor.advisorLapata, Mirella
dc.contributor.advisorLavrenko, Victor
dc.contributor.authorFeng, Yansong
dc.date.accessioned2011-09-07T13:24:11Z
dc.date.available2011-09-07T13:24:11Z
dc.date.issued2011-06-30
dc.identifier.urihttp://hdl.handle.net/1842/5291
dc.description.abstractThis thesis is concerned with the task of automatically generating captions for images, which is important for many image-related applications. Automatic description generation for video frames would help security authorities manage more efficiently and utilize large volumes of monitoring data. Image search engines could potentially benefit from image description in supporting more accurate and targeted queries for end users. Importantly, generating image descriptions would aid blind or partially sighted people who cannot access visual information in the same way as sighted people can. However, previous work has relied on fine-gained resources, manually created for specific domains and applications In this thesis, we explore the feasibility of automatic caption generation for news images in a knowledge-lean way. We depart from previous work, as we learn a model of caption generation from publicly available data that has not been explicitly labelled for our task. The model consists of two components, namely extracting image content and rendering it in natural language. Specifically, we exploit data resources where images and their textual descriptions co-occur naturally. We present a new dataset consisting of news articles, images, and their captions that we required from the BBC News website. Rather than laboriously annotating images with keywords, we simply treat the captions as the labels. We show that it is possible to learn the visual and textual correspondence under such noisy conditions by extending an existing generative annotation model (Lavrenko et al., 2003). We also find that the accompanying news documents substantially complements the extraction of the image content. In order to provide a better modelling and representation of image content,We propose a probabilistic image annotation model that exploits the synergy between visual and textual modalities under the assumption that images and their textual descriptions are generated by a shared set of latent variables (topics). Using Latent Dirichlet Allocation (Blei and Jordan, 2003), we represent visual and textual modalities jointly as a probability distribution over a set of topics. Our model takes these topic distributions into account while finding the most likely keywords for an image and its associated document. The availability of news documents in our dataset allows us to perform the caption generation task in a fashion akin to text summarization; save one important difference that our model is not solely based on text but uses the image in order to select content from the document that should be present in the caption. We propose both extractive and abstractive caption generation models to render the extracted image content in natural language without relying on rich knowledge resources, sentence-templates or grammars. The backbone for both approaches is our topic-based image annotation model. Our extractive models examine how to best select sentences that overlap in content with our image annotation model. We modify an existing abstractive headline generation model to our scenario by incorporating visual information. Our own model operates over image description keywords and document phrases by taking dependency and word order constraints into account. Experimental results show that both approaches can generate human-readable captions for news images. Our phrase-based abstractive model manages to yield as informative captions as those written by the BBC journalists.en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionFeng, Y. and Lapata, M. (2008). Automatic image annotation using auxiliary text information. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, pages 272–280, Morristown, NJ, USA. Association for Computational Linguistics.en
dc.relation.hasversionFeng, Y. and Lapata, M. (2010a). How many words is a picture worth? automatic caption generation for news images. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1239–1249, Uppsala, Sweden. Association for Computational Linguistics.en
dc.relation.hasversionFeng, Y. and Lapata, M. (2010b). Topic models for image annotation and text illustration. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 831–839, Los Angeles, California. Association for Computational Linguistics.en
dc.relation.hasversionFeng, Y. and Lapata, M. (2010c). Visual information in semantic representation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 91–99, Los Angeles, California. Association for Computational Linguistics.en
dc.subjectimage annotationen
dc.subjectimage caption generationen
dc.titleAutomatic caption generation for news imagesen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record