This thesis proposes a new approach for structured knowledge discovery from texts
which considers both the mining process itself, the evaluation of this knowledge by the
model, and the human assessment of the quality of the outcome.
This is achieved by integrating Natural-Language technology and Genetic Algorithms to produce explanatory novel hypotheses. Natural-Language techniques are
specifically used to extract genre-based information from text documents. Additional
semantic and rhetorical information for generating training data and for feeding a semistructured Latent Semantic Analysis process is also captured.
The discovery process is modeled by a semantically-guided Genetic Algorithm
which uses training data to guide the search and optimization process. A number of
novel criteria to evaluate the quality of the new knowledge are proposed. Consequently,
new genetic operations suitable for text mining are designed, and techniques for Evolutionary Multi-Objective Optimization are adapted for the model to trade off between
different criteria in the hypotheses.
Domain experts were used in an experiment to assess the quality of the hypotheses
produced by the model so as to establish their effectiveness in terms of novel and
interesting knowledge. The assessment showed encouraging results for the discovered
knowledge and for the correlation between the model and the human opinions.