Programming language semantics as a foundation for Bayesian inference
MetadataShow full item record
Bayesian modelling, in which our prior belief about the distribution on model parameters is updated by observed data, is a popular approach to statistical data analysis. However, writing specific inference algorithms for Bayesian models by hand is time-consuming and requires significant machine learning expertise. Probabilistic programming promises to make Bayesian modelling easier and more accessible by letting the user express a generative model as a short computer program (with random variables), leaving inference to the generic algorithm provided by the compiler of the given language. However, it is not easy to design a probabilistic programming language correctly and define the meaning of programs expressible in it. Moreover, the inference algorithms used by probabilistic programming systems usually lack formal correctness proofs and bugs have been found in some of them, which limits the confidence one can have in the results they return. In this work, we apply ideas from the areas of programming language theory and statistics to show that probabilistic programming can be a reliable tool for Bayesian inference. The first part of this dissertation concerns the design, semantics and type system of a new, substantially enhanced version of the Tabular language. Tabular is a schema-based probabilistic language, which means that instead of writing a full program, the user only has to annotate the columns of a schema with expressions generating corresponding values. By adopting this paradigm, Tabular aims to be user-friendly, but this unusual design also makes it harder to define the syntax and semantics correctly and reason about the language. We define the syntax of a version of Tabular extended with user-defined functions and pseudo-deterministic queries, design a dependent type system for this language and endow it with a precise semantics. We also extend Tabular with a concise formula notation for hierarchical linear regressions, define the type system of this extended language and show how to reduce it to pure Tabular. In the second part of this dissertation, we present the first correctness proof for a Metropolis-Hastings sampling algorithm for a higher-order probabilistic language. We define a measure-theoretic semantics of the language by means of an operationally-defined density function on program traces (sequences of random variables) and a map from traces to program outputs. We then show that the distribution of samples returned by our algorithm (a variant of “Trace MCMC” used by the Church language) matches the program semantics in the limit.