Speech Recognition and the System.Speech namespace

December 08, 2008

I had a problem recently where I wanted to find out, very roughly, the topic of a set of un-annotated lectures. I needed a recognition engine that was speaker-independent, and required no training, as I don’t personally know the guy who did them.

In the .NET framework 3.5, you get access to the System.Speech namespace which includes classes for both Recognition (System.Speech.Recognition) and Synthesis (System.Speech.Synthesis). It is a nice clean wrapper around the recogniser and synthesis engines built into XP and Vista.

The exact class I used for recognition was the SpeechRecogntionEngine. My form simply provides two buttons to start/stop recognition and a textbox where all the results from the recognition are updated.

In my Form1 constructor I create a new SpeechRecongitionEngine and initialise it to the default audio device. This uses whatever you have selected as the recording device in the volume control panel as the input to the recogniser.

public SpeechRecognitionEngine recog;

public Grammar gram;

public Form1() { InitializeComponent();

recog = new SpeechRecognitionEngine();

ThreadPool.QueueUserWorkItem(new WaitCallback(InitSpeechRecogniser));

}

public void InitSpeechRecogniser(object o) { recog.SetInputToDefaultAudioDevice(); }

You might be wondering why I need to queue the InitSpeechRecogniser method as a work item on the thread pool. Very strange, but this gets round a quirky exception you’ll get otherwise, described at http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=1139503&SiteID=1, only under Windows XP machines, which is what my dev machine is.

When you click the start recognition button, the recognition engine begins and I wanted to test the accuracy of a default grammar, if one was available. DictationGrammar serves this purpose. The following start button handler loads the grammar of the recognition engine, and hooks up some useful events as follows:

private void buttonStart_Click(object sender, EventArgs e) {

gram = new DictationGrammar();

recog.LoadGrammar(gram);

recog.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recog_SpeechRecognized); recog.RecognizeAsync(RecognizeMode.Multiple);

}

I start the recogniser asynchronously in RecogniseMode.Multiple mode, so it doesn’t terminate on recognition (I want it to terminate when I say so!).

Having set my recording device as the Wave Out, so whatever I played out in WMP was the subject of recognition, I began getting some SpeechRecognised events. The way I handled this event was as follows:

void recog_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { textBox1.Invoke(new UpdateTextDelegate(UpdateTextBox), e.Result.Text); }

I use a delegate here as the recognition thread is different to the UI thread and so for cross-threading reasons you get an exception if you don’t call across to the UI thread via the Invoke method. The UpdateTextBox method is as follows:

public void UpdateTextBox(string text) { textBox1.Text += text;

}

This simply appends the text from the recognition event (e.Result.Text) to the textBox holding my results from previous recognitions.

As it turns out, the recognition was completed unsatisfactory so my next plan is to limit the available Grammar to a set of keywords I’m looking for which may give better results…