Information Services banner Edinburgh Research Archive The University of Edinburgh crest

Edinburgh Research Archive >
Informatics, School of >
Informatics thesis and dissertation collection >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1842/799

This item has been viewed 150 times in the last year. View Statistics

Files in This Item:

File Description SizeFormat
IB_thesis.pdf792.03 kBAdobe PDFView/Open
Title: Conditional-Entropy Metrics for Feature Selection
Authors: Bancarz, Iain
Supervisor(s): Osborne, Miles
Issue Date: Jun-2005
Publisher: University of Edinburgh. College of Science and Engineering. School of Informatics.
Abstract: We examine the task of feature selection, which is a method of forming simplified descriptions of complex data for use in probabilistic classifiers. Feature selection typically requires a numerical measure or metric of the desirability of a given set of features. The thesis considers a number of existing metrics, with particular attention to those based on entropy and other quantities derived from information theory. A useful new perspective on feature selection is provided by the concepts of partitioning and encoding of data by a feature set. The ideas of partitioning and encoding, together with the theoretical shortcomings of existing metrics, motivate a new class of feature selection metrics based on conditional entropy. The simplest of the new metrics is referred to as expected partition entropy or EPE. Performances of the new and existing metrics are compared by experiments with a simplified form of part-of-speech tagging and with classification of Reuters news stories by topic. In order to conduct the experiments, a new class of accelerated feature selection search algorithms is introduced; a member of this class is found to provide significantly increased speed with minimal loss in performance, as measured by feature selection metrics and accuracy on test data. The comparative performance of existing metrics is also analysed, giving rise to a new general conjecture regarding the wrapper class of metrics. Each wrapper is inherently tied to a specific type of classifier. The experimental results support the idea that a wrapper selects feature sets which perform well in conjunction with its own particular classifier, but this good performance cannot be expected to carry over to other types of model. The new metrics introduced in this thesis prove to have substantial advantages over a representative selection of other feature selection mechanisms: Mutual information, frequency-based cutoff, the Koller-Sahami information loss measure, and two different types of wrapper method. Feature selection using the new metrics easily outperforms other filter-based methods such as mutual information; additionally, our approach attains comparable performance to a wrapper method, but at a fraction of the computational expense. Finally, members of the new class of metrics succeed in a case where the Koller-Sahami metric fails to provide a meaningful criterion for feature selection.
Description: Institute for Communicating and Collaborative Systems
URI: http://hdl.handle.net/1842/799
Appears in Collections:Informatics thesis and dissertation collection

Items in ERA are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh 2013, and/or the original authors. Privacy and Cookies Policy