Reactive Extensions and Project Oxford for Cortana-like Speech Recognition Feedback

Project Oxford is a collection of APIs and SDKs from Microsoft that includes tools for transforming speech to text and text to speech.  Modern applications that leverage speech to text often display partial recognition results, to give the user immediate feedback, and reduce the overall perceived latency, as shown below in Cortana.

Cortana Screenshot
Partial response example for “Cortana, is it going to rain today?”.

Speech Recognition SDK Overview

The Windows SDK for speech recognition in Project Oxford (which can be downloaded at https://www.projectoxford.ai/SDK) includes the ability to capture and display partial results. The API uses C# events to notify the client of everything from partial results, to recognition errors, and the final recognized result. The SDK can be used to either capture microphone input directly, or accept an audio stream in chunks, as shown below.

                // Capture microphone input
                client.AudioStart();
                client.AudioStop();

                // Push audio stream
                Stream stream;
                var count = default(int);
                var buffer = new byte[1024];
                while ((count = stream.Read(buffer, 0, 1024)) > 0)
                {
                    client.SendAudio(buffer, count);
                }
                client.EndAudio();

In either case, the handler logic is the same. Here is a very simple example that captures the events and prints them to a console window.

                client.OnConversationError += (sender, args) =>
                {
                    Console.WriteLine("Error {0}, {1}", args.SpeechErrorCode, args.SpeechErrorText);
                };

                client.OnPartialResponseReceived += (sender, args) =>
                {
                    Console.WriteLine("Received partial response: {0}", args.PartialResult);
                };

                client.OnResponseReceived += (sender, args) =>
                {
                    switch (args.PhraseResponse.RecognitionStatus)
                    {
                        case RecognitionStatus.Intermediate:
                            Console.WriteLine("Received intermediate response: {0}", args.PhraseResponse.Results.First().DisplayText);
                            break;
                        case RecognitionStatus.RecognitionSuccess:
                            Console.WriteLine("Received success response: {0}", args.PhraseResponse.Results.First().DisplayText);
                            break;
                        case RecognitionStatus.NoMatch:
                        case RecognitionStatus.None:
                        case RecognitionStatus.InitialSilenceTimeout:
                        case RecognitionStatus.BabbleTimeout:
                        case RecognitionStatus.HotWordMaximumTime:
                        case RecognitionStatus.Cancelled:
                        case RecognitionStatus.RecognitionError:
                        case RecognitionStatus.DictationEndSilenceTimeout:
                        case RecognitionStatus.EndOfDictation:
                        default:
                            Console.WriteLine("Received {0} response.", args.PhraseResponse.RecognitionStatus);
                            break;
                    }
                };

There are two modes for speech recognition supported by the SDK, short phrase and long dictation. The former is designed for single-shot utterances such as commands or queries, and the latter is more for capturing longer sessions, such as email or text message dictation. Here is a summary of the kinds of events and status codes I was able to produce “in the wild” (i.e., by babbling at my laptop):

Response Type Short Phrase Long Dictation
OnPartialResponseReceived Y Y
OnConversationError Y Y
OnResponseReceived
    None (0) N N
    Intermediate (100) N N
    RecognitionSuccess (200) Y Y
    Cancelled (201) N N
    NoMatch (301) Y Y
    InitialSilenceTimeout (303) Y Y
    BabbleTimeout (304) N N
    HotWordMaximumTime (305) N N
    RecognitionError (500) N N
    DictationEndSilenceTimeout (610) N Y
    EndOfDictation (612) N Y

The long dictation mode typically consists of a series of partial speech responses terminated by a regular response.  For example, if the user spoke “Four score and seven years ago… our fathers brought forth on this continent, a new nation…”, the event handling logic above would produce something similar to the following output:

Received partial result: four
Received partial result: four score and
Received partial result: four score and seven years ago
Received success result: Four score and seven years ago.
Received partial result: our fathers brought
Received partial result: our fathers brought forth on this continent
Received partial result: our fathers brought forth on this continent a new nation
Received partial result: Our fathers brought forth on this continent a new nation.

The short phrase mode is similar, except that it will only return a single response, so any utterances made after the first response are ignored.

Speech Recognition With Reactive Extensions

The trouble with an event-driven approach to speech recognition handling is that by decoupling the events, you also lose some of the semantics of their sequencing. That is to say that event handlers are assigned by event type, rather than event order.  So, if you wanted to introduce logic that had special handling based on the sequence of recognition results, some kind of shared state accessible to each of the event handlers would be required.

Consider, for example, a long dictation mode scenario where the first pause corresponded to the title of a dictated blog post. The user might say, “A Blog Post About My Cat [pause] My cat is the greatest cat because it has orange fur. [pause] She is also afraid of vacuum cleaners and loves laser pointers.” Here is some sample code that implements this with partial feedback on both the title and the sentences:

            var count = 0;

            client.OnConversationError += (sender, args) =>
            {
                Console.Error.WriteLine("Failed with code '{0}' and text '{1}'.", args.SpeechErrorCode, args.SpeechErrorText);
            };

            client.OnPartialResponseReceived += (sender, args) =>
            {
                Console.CursorLeft = 0;
                var prefix = (count == 0) ? "Title" : "Sentence " + count;
                Console.Write("{0}: {1}", prefix, args.PartialResult);
            };

            client.OnResponseReceived += (sender, args) =>
            {
                if (args.PhraseResponse.RecognitionStatus == RecognitionStatus.RecognitionSuccess)
                {
                    var result = args.PhraseResponse.Results.First().DisplayText;
                    Console.CursorLeft = 0;
                    var prefix = (count == 0) ? "Title" : "Sentence " + count;
                    Console.WriteLine("{0}: {1}", prefix, result);
                    count++;
                }
            };

Notice that in order to implement this scenario, we introduces the shared state, `isTitleSet`, and switched on that shared state.

However, another option for modeling these sequences of partial results are using the observable abstraction from the Reactive Extensions (Rx) framework.  Specifically, each partial or final response would be modeled as an `OnNext` event, and the final response would be terminated with an `OnCompleted` event.  In the case of long dictation mode, the series of partial-followed-by-regular-responses would be modeled as an observable of observables, or IObservable<IObservable<RecognizedPhrase>>.

So, for the blog post dictation example above, here’s some logic using Rx:

            var sentenceSubscriptions = client.GetResponseObservable()
                .Select((observable, count) => new { observable, count })
                .Subscribe(
                    x => x.observable.Subscribe(
                        phrases =>
                        {
                            Console.CursorLeft = 0;
                            var firstPhrase = phrases.First();
                            var prefix = x.count == 0 ? "Title" : "Sentence " + x.count;
                            Console.Write("{0}: {1}", prefix, firstPhrase.DisplayText ?? firstPhrase.LexicalForm);
                        },
                        ex => Console.Error.WriteLine(ex),
                        () => Console.WriteLine()));

For those very familiar with Rx, all the logic to dispose subscriptions is left out of this example (sorry!), in the same way that the logic to “subtract” the event handlers from the previous example is left out.

Beyond introducing more explicit semantics for the event sequences that occur in the Project Oxford speech recognition APIs, using Reactive Extensions here allows users to write code with LINQ syntax, and also takes care of cleaning up all the event handlers on the client after you are no longer using them (assuming you dispose your subscriptions!).

Implementing the Speech Recognition Observable

The last example uses an extension method on the Project Oxford client with the following signature:

        public static IObservable<IObservable<RecognizedPhrase>> GetResponseObservable(this DataRecognitionClient client);

However, this was primarily to simplify the example. In reality, Project Oxford returns a set of candidates for what the utterances may be, so the signature would look like:

        public static IObservable<IObservable<IEnumerable<RecognizedPhrase>>> GetResponseObservable(this DataRecognitionClient client);

The implementation of this is rather simple. Using the latest bits from Rx.NET, this implementation is little more than a combination of the FromEventPattern, Merge, and Window operators.  Here’s the specific implementation:

        public static IObservable<IObservable<IEnumerable<RecognizedPhrase>>> GetResponseObservable(this DataRecognitionClient client)
        {
            var errorObservable = Observable.FromEventPattern<SpeechErrorEventArgs>(
                    h => client.OnConversationError += h,
                    h => client.OnConversationError -= h)
                .Select<EventPattern<MicrosoftProjectOxford.SpeechErrorEventArgs>, IEnumerable<RecognizedPhrase>>(
                    x => { throw new SpeechRecognitionException(x.EventArgs.SpeechErrorCode, x.EventArgs.SpeechErrorText); });

            var partialObservable = Observable.FromEventPattern<PartialSpeechResponseEventArgs>(
                    h => client.OnPartialResponseReceived += h,
                    h => client.OnPartialResponseReceived -= h)
                .Select(x => Enumerable.Repeat(RecognizedPhrase.CreatePartial(x.EventArgs.PartialResult), 1));

            var responseObservable = Observable.FromEventPattern<SpeechResponseEventArgs>(
                    h => client.OnResponseReceived += h,
                    h => client.OnResponseReceived -= h)
                .Select(x =>
                {
                    var response = x.EventArgs.PhraseResponse;
                    switch (response.RecognitionStatus)
                    {
                        case RecognitionStatus.Intermediate:
                            return response.Results.Select(p => RecognizedPhrase.CreateIntermediate(p));
                        case RecognitionStatus.RecognitionSuccess:
                            return response.Results.Select(p => RecognizedPhrase.CreateSuccess(p));
                        case RecognitionStatus.InitialSilenceTimeout:
                            throw new InitialSilenceTimeoutException();
                        case RecognitionStatus.BabbleTimeout:
                            throw new BabbleTimeoutException();
                        case RecognitionStatus.Cancelled:
                            throw new OperationCanceledException();
                        case MicrosoftProjectOxford.RecognitionStatus.DictationEndSilenceTimeout:
                            throw new DictationEndTimeoutException();
                        case RecognitionStatus.EndOfDictation:
                        case RecognitionStatus.HotWordMaximumTime:
                        case RecognitionStatus.NoMatch:
                        case RecognitionStatus.None:
                        case RecognitionStatus.RecognitionError:
                        default:
                            throw new SpeechRecognitionException();
                    }
                });

            return responseObservable.Publish(observable =>
                Observable.Merge(errorObservable, partialObservable, observable)
                    .Window(() => observable));
        }

In addition to the core logic above, a few data models were introduced, including the exception types for errors and timeouts, as well a replacement class for RecognizedPhrase that was able to represent both success responses and partial responses. For the full implementation, check out my GitHub repository, RxToProjectOxford.

Reactive Extensions and Project Oxford for Cortana-like Speech Recognition Feedback