Sujets de Master-R2 2016-2017


Subject 1

Title: Data selection for the training of deep neural networks in the framework of automatic speech recognition

Supervisor: Irina Illina

Team and lab: MultiSpeech

Contact: illina@loria.fr

Co-supervisor: Dominique Fohr

Team and lab: MultiSpeech

Contact : dominique.fohr@loria.fr

 

Motivations and context

More and more audio/video appear on Internet each day. About 300 hours of multimedia are uploaded per minute. Nobody is able to go thought this quantity of data. In these multimedia sources, audio data represents a very important part. Classical approach for spoken content retrieval from audio documents is an automatic speech recognition followed by a text retrieval. In this internship, we will focus on the speech recognition system.

One of the important modules of an automatic speech recognition system is the acoustic model: it models the sounds of speech, mainly phonemes. Currently, the best performing models are based on deep neural networks. These models are trained on a very large amount of audio data, because the models contain millions of parameters to estimate. For training acoustic models, it is necessary to have audio documents for which the exact text transcript is available (supervised training).

The Multi-Genre Broadcast (MGB) Challenge is an evaluation of speech recognition systems, using TV recordings in English or Arabic. The speech data is broad and multi-genre, spanning the whole range of TV output, and represents a challenging task for speech technology. Speech Data cover the multiple genres in broadcast TV, categorized in terms of 8 genres: advice, children’s, comedy, competition, documentary, drama, events and news.

The problem with MGB data challenge is that the exact transcription of audio documents is not available. Only subtitles of TV recordings are given. These subtitles are sometimes far from what is actually pronounced: some words may be omitted, hesitations are rarely transcribed and some sentences are reformulated.

In this internship, we will focus on the problem of data selection for efficient acoustic model training.

Objectives

A subtitle is composed of a text, a start time appearance (timecode) on the screen and an end time of appearance. These start and end times are given relative to the beginning of the program. It is easy to associate subtitle and the corresponding audio segment.

We have at our disposal a very large audio corpus with the corresponding subtitles and we want to develop data selection methods for obtaining high performance acoustic models. That is to say with a word error rate as small as possible. If we use all the training data, the errors in the subtitles will lead to poor quality acoustic models and therefore a high recognition word error rate.

We propose to use a deep neural network (DNN) to classify the segments into two categories: audio segments corresponding to subtitles and audio segments not corresponding to subtitles. The student will analyze what information, acoustic and/or linguistic, is relevant to this selection task and can be used as input of the DNN.

The student will validate the proposed approaches using the automatic transcription system of TV broadcast developed in our team.

Required skills background in statistics, natural language processing and computer program skills (Perl, Python).

Localization and contacts: Loria laboratory, Speech team, Nancy, France

irina.illina@loria.fr dominique.fohr@loria.fr

Candidates should email a detailed CV with diploma

===========================================================================

Subject 2

Title:  Domain adaptation of neural network language model for speech recognition

Supervisor: Irina Illina

Team and lab: MultiSpeech

Contact: illina@loria.fr

Co-supervisor: Dominique Fohr

Team and lab: MultiSpeech

Contact : dominique.fohr@loria.fr

Motivation and Context

Language models (LMs) play a key role in modern automatic speech recognition systems and ensure that the output respects the pattern of the language. In the state-of-the-art systems, the language model is a combination of n-gram LMs and neural network LMs because they are complementary. These LM are trained on huge text corpora.

The language models are trained on a corpus of varied texts, which provides average performance on all types of data. However, document content is generally heavily influenced by the domain, which can include topic, genre (documentary, news, etc.) and speaking style. It has been shown that domain adaptation of LMs to small amounts of matched in-domain text data provide significant improvements in both perplexity and word error rate. The objective of the internship is to adapt a neural networks based language model to the domain of an audio document to be recognized. For this, we will use a small specific text corpus.

The Multi-Genre Broadcast (MGB) Challenge is an evaluation campaign of speech recognition systems, using TV recordings in English or Arabic. The speech data is broad and multi-genre, spanning the whole range of TV output, and represents a challenging task for speech technology. Speech data covers the multiple genres in broadcast TV, categorized in terms of 8 genres: advice, children’s, comedy, competition, documentary, drama, events and news.

During the internship, the student will develop LM adaptation methods in the context of the MGB data.

Goals and Objectives

Neural network LM adaptation can be categorized as either feature-based or model-based.

In the feature-based adaptation, the input of the neural network is augmented with auxiliary features, which model domain, topic information, etc. However, these auxiliary features must be learn during the training of the LM model and thus require whole model retraining.

Model-based adaptation consists in adding complementary layers and training these layers with domain-specific adaptation data. An advantage of this method is that full retraining is not necessary. Another model-based adaptation method is fine-tuning: after training the model with the whole training data, the model is tuned with the target domain data. The downside of this approach is the lack of the optimization objective.

During the internship, the student will perform a bibliographic study on model adaptation approaches. Depending on the pros and cons of these approaches, we will propose a method specific to MGB data. This method may include changing the architecture of the neural network.

The student will validate the proposed approaches using the automatic transcription system of radio broadcast developed in our team.

Required skills background in statistics, natural language processing and computer program skills (Perl, Python).

Localization and contacts: Loria laboratory, Speech team, Nancy, France

irina.illina@loria.fr dominique.fohr@loria.fr

Candidates should email a detailed CV with diploma

===========================================================================

Subject 3

Title: Using Wikipedia to search for proper names relevant to the audio document transcription

Supervisor: Irina Illina

Team and lab: MultiSpeech

Contact: illina@loria.fr

Co-supervisor: Dominique Fohr

Team and lab: MultiSpeech

Contact : dominique.fohr@loria.fr

Motivation and Context

More and more audio/video appear on Internet each day. About 300 hours of multimedia are uploaded per minute. Nobody is able to go thought this quantity of data. In these multimedia sources, audio data represents a very important part. Classical approach for spoken content retrieval from audio documents is an automatic speech recognition followed by a text retrieval.

An automatic speech recognition system uses a lexicon containing the most frequent words of the language and only the words of the lexicon can be recognized by the system. New proper names (PNs) appear constantly, requiring dynamic updates of the lexicons used by the speech recognition system. These PNs evolve over time and no vocabulary will ever contains all existing PNs. These missing proper names can be very important for the understanding of the test document.

In this study, we will focus on the problem of proper names in automatic recognition systems. The problem is to find relevant proper names for the audio document we want to transcribe. For this task, we will use a remarkable source of information: Wikipedia, free online encyclopedia, the largest and most popular reference work on the internet. Wikipedia contains a huge number of proper names.

Goals and Objectives

We assume that in an audio document to transcribe we have missing proper names, i.e. proper names that are pronounced in the audio document but that are not in the lexicon of the automatic speech recognition system; these proper names cannot be recognized (out-of-vocabulary proper names, OOV PNs)

The goal of this internship is to find a list of relevant OOV PNs that correspond to an audio document. We will use Wikipedia as a source of potential proper names.

Assuming that we have an approximate transcription of the audio document and a Wikipedia dump, two main points will be addressed:

  • How to represent a Wikipedia page and in which space? One possibility is to use word embeddings (for instance Mikolov’s word2vec ).
  • Using the previous representation, how to select relevant pages according to the approximate transcription of the audio document? The automatic speech recognition systems of broadcast news have a 10-20% word error rate. If we project documents in a continuous space, different distances can be studied.

In a first step, we can consider Wikipedia page as a simple text page. In a second step, the student should use the structure of the Wikipedia pages (page links, tables, headings, infobox).

During the internship, the student will investigate methodologies based on deep neural networks.

The student will validate the proposed approaches using the automatic transcription system of radio broadcast developed in our team.

Required skills background in statistics, natural language processing and computer program skills (Perl, Python).

Localization and contacts: Loria laboratory, Speech team, Nancy, France

irina.illina@loria.fr dominique.fohr@loria.fr

Candidates should email a detailed CV with diploma

===========================================================================