Edinburgh Research Archive

View Item 
  •   DSpace Home
  • Informatics, School of
  • Informatics thesis and dissertation collection
  • View Item
  •   DSpace Home
  • Informatics, School of
  • Informatics thesis and dissertation collection
  • View Item
    • Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Learning representations for speech recognition using artificial neural networks

    Download
    Swietojanski2016.pdf (3.410Mb)
    Date
    2016-11-29
    Author
    Swietojanski, Paweł
    Metadata
    Show full item record
    Abstract
    Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals.
    URI
    http://hdl.handle.net/1842/22835
    Collections
    • Informatics thesis and dissertation collection

    Privacy & Cookies | Takedown Policy | Accessibility | Contact
    Privacy & Cookies
    Takedown Policy
    Accessibility
    Contact
     

     

    Browse

    All of DSpaceCommunities & CollectionsIssue DateAuthorsTitlesSubjectsPublication TypeSponsorThis CollectionIssue DateAuthorsTitlesSubjectsPublication TypeSponsor

    My Account

    LoginRegister

    Privacy & Cookies | Takedown Policy | Accessibility | Contact
    Privacy & Cookies
    Takedown Policy
    Accessibility
    Contact