Learning to make decisions with unforeseen possibilities
Methods for learning optimal policies often assume that the way the domain is conceptualised— the possible states and relevant actions that are needed to solve one’s decision problem—is known in advance and does not change during learning. This is an unrealistic assumption in many scenarios. Often, new evidence can reveal important information about what is possible, not just what is likely, or unlikely. A learner may have been completely unaware such possibilities even existed prior to learning. This thesis presents a model of an agent which discovers and exploits unforeseen possibilities from two sources of evidence: domain exploration and communication with an expert. The model combines probabilistic and symbolic reasoning to estimate all components of the decision problem, including the set of belief variables, the possible actions, and the probabilistic dependencies between variables. Unlike prior work on solving decision problems by discovering and learning to exploit unforeseen possibilities (e.g., Rong (2016); McCallum and Ballard (1996)), our model supports discovering and learning to exploit unforeseen factors, as opposed to an additional atomic state. Becoming aware of an unforeseen factor presents computational challenges when compared with becoming aware of an additional atomic state, because even a boolean factor doubles the size of the decision problem’s hypothesis space as opposed to increasing it by just one more state. We show via experiments that one can meet those challenges by adopting (defeasible) reasoning principles that are familiar from the literature on belief revision: roughly, default to simple models over more complex ones and default to conserving what you’ve learned from prior evidence. For one-step decision problems, our agent learns the components of a Decision Network; for sequential problems, it learns a Factored Markov Decision Process. We prove convergence theorems for our models, given the learner’s and expert’s strategies for gathering evidence. Furthermore, our experiments show that the agent converges on optimal behaviour even when it starts out completely unaware of factors that are critical to success.