Speech-to-Text

This plugin interfaces Windows streaming, Wit.ai non-streaming, Google streaming/non-streaming, and IBM Watson streaming/non-streaming speech-to-text. There is also a sample scene that compares each of these APIs. This article on the Unity Labs website explains some of the concepts behind speech recognition and discusses the motivation behind this package.

Requirements
Setting up the sample scene
Recording and comparing results
Acquiring credentials
Architecture
Forks

Requirements

Matthew Schoen from Unity Labs has given us permission to include his JSON library in the package.
Windows and Google streaming speech-to-text will only work in Windows environments.
Watson streaming and non-streaming speech-to-text both rely on IBM's Watson SDK for Unity, which must be manually added to the project. The Unity Watson SDK can be found here.
Google non-streaming and Wit.ai non-streaming speech-to-text both rely on UniWeb, which must be manually added to the project. UniWeb can be found on the Unity Asset Store here.
Google streaming and non-streaming speech-to-text both rely on SoX (Sound eXchange), which must be manually added to the project. SoX can be found here. The SoX application must be located within Application.streamingAssetsPath/ThirdParty/SoX/Windows for Windows environments, and Application.streamingAssetsPath/ThirdParty/SoX/MacOSX otherwise.

Setting up the sample scene

Open the scene "speechToTextComparison.unity".
Enter your credentials for each API by going through each child of "Canvas/SpeechToTextServiceWidgets" in the Inspector and changing the appropriate field(s) in the "[Specific Name Here] Speech To Text Service" component. Note that Google streaming speech-to-text uses a JSON credentials file, which must be saved under "GoogleStreamingSpeechToTextProgram" under Application.streamingAssetsPath, and whose name must match the "JSON Credentials File Name" field of the "Google Streaming Speech To Text Service" component of "Canvas/SpeechToTextServiceWidgets/GoogleStreamingSpeechToTextService". You will only receive transcriptions from APIs for which you have provided valid credentials (except Windows, which does not require any). See the "Acquiring credentials" section for instructions on acquiring credentials for each API.
Configure any parameters that you wish to change for each service (timeout, audio chunk length, etc.) Check the "Speech To Text Comparison Widget" component of "Canvas/SpeechToTextComparisonWidget" in the Inspector to make sure that all the services you wish to test are listed under "Speech To Text Service Widgets", and add/remove services from this list as needed.
The scene is now ready to run. Refer to Recording and comparing results for how the sample scene works.

Recording and comparing results

Real-time results will be displayed for streaming speech-to-text, and results for non-streaming speech-to-text will be displayed after you stop recording.
One single recording session can only last 15 seconds before timing out, but this can be changed by looking at the "Audio Recording Manager" component of the "Singletons" game object in the Inspector and modifying "Max Recording Length In Seconds".
After recording, if the application has not received all results within 10 seconds, it will stop listening for results. This can be changed by going to the "Speech To Text Comparison Widget" component of "Canvas/SpeechToTextComparisonWidget" in the Inspector and changing "Responses Timeout In Seconds".
If you have selected a sample phrase before you stop recording, then each end result will be compared against this sample phrase and the accuracy will be displayed.
If you have selected "Save results to file?", then text results at the end of each recording session will be saved to a .txt file in the Application.dataPath/SpeechToText folder. A new file will be used with each run of the application.

Acquiring credentials

Google Cloud Speech

Sign up for a Google Cloud Platform account.
Sign up for Google Cloud Speech API.
Once you have been granted access to Google Cloud Speech API, refer to the documentation for instructions on generating an API key (for non-streaming speech-to-text) and a JSON service account key (for streaming speech-to-text).

IBM Watson Speech to Text

Sign up for an IBM Bluemix account.
Sign up for IBM Watson Speech to Text.
Once you have been granted access to IBM Watson Speech to Text, refer to the documentation for instructions on generating service credentials.

Wit.ai

Sign up for a Wit.ai account.
Create a new app through the Wit.ai console.
Your server access token will be listed under your app settings.

Architecture

Namespaces

UnitySpeechToText includes all non-third-party scripts in the package.
UnitySpeechToText.Services includes all speech-to-text service and result scripts.
UnitySpeechToText.Widgets includes the speech-to-text widget scripts used in the sample scene.
UnitySpeechToText.Utilities includes all general utility scripts, such as AudioRecordingManager and DebugFlags.

Speech-to-text services and results inheritance hierarchy

MonoBehaviour
- SpeechToTextService
  - NonStreamingSpeechToTextService
    - GoogleNonStreamingSpeechToTextService
    - WatsonNonStreamingSpeechToTextService
    - WitAiNonStreamingSpeechToTextService
  - StreamingSpeechToTextService
    - GoogleStreamingSpeechToTextService
    - WatsonStreamingSpeechToTextService
  - WindowsSpeechToTextService
SpeechToTextResult
TextAlternative
- GoogleTextAlternative
- WatsonTextAlternative
- WindowsTextAlternative

Speech-to-text services and results base functions and properties

SpeechToTextService
- public bool IsRecording [get]
  - Whether the service is recording audio
- public void RegisterOnTextResult(Action<SpeechToTextResult> action)
  - Adds a function to the text result delegate.
- public void UnregisterOnTextResult(Action<SpeechToTextResult> action)
  - Removes a function from the text result delegate.
- public void RegisterOnError(Action<string> action)
  - Adds a function to the error delegate.
- public void UnregisterOnError(Action<string> action)
  - Removes a function from the error delegate.
- public void RegisterOnRecordingTimeout(Action action)
  - Adds a function to the recording timeout delegate.
- public void UnregisterOnRecordingTimeout(Action action)
  - Removes a function from the recording timeout delegate.
- public virtual bool StartRecording()
  - Starts recording audio if the service is not already recording.
  - Returns whether the service successfully started recording.
- public virtual bool StopRecording()
  - Stops recording audio if the service is already recording.
  - Returns whether the service successfully stopped recording.
- protected virtual void Start()
  - Initialization function called on the frame when the script is enabled just before any of the Update methods is called the first time.
- protected virtual void OnDestroy()
  - Function that is called when the MonoBehaviour will be destroyed.
- protected void OnRecordingTimeout()
  - Function that is called when the recording times out.
NonStreamingSpeechToTextService
- public override bool StartRecording()
  - Starts recording audio if the service is not already recording.
  - Returns whether the service successfully started recording.
- public override bool StopRecording()
  - Stops recording audio if the service is already recording.
  - Returns whether the service successfully stopped recording.
- private IEnumerator RecordAndTranslateToText()
  - Records audio and translates any speech to text.
- protected abstract IEnumerator TranslateRecordingToText()
  - Translates speech to text by making a request to the speech-to-text API.
StreamingSpeechToTextService
- public float SessionTimeoutAfterDoneRecording [set]
  - Number of seconds after recording to wait until the session times out
- public float AudioChunkLengthInSeconds [set]
  - Length (in seconds) of each chunk of recorded audio to send to the server
- public override bool StartRecording()
  - Starts recording audio if the service is not already recording.
  - Returns whether the service successfully started recording.
- public override bool StopRecording()
  - Stops recording audio if the service is already recording.
  - Returns whether the service successfully stopped recording.
- private IEnumerator RecordAudio()
  - Records audio and queues fixed audio chunks.
- protected abstract IEnumerator StreamAudioAndListenForResponses()
  - Sends queued chunks of audio to the server and then waits for the transcription(s).
SpeechToTextResult
- public bool IsFinal [get and set]
  - Whether this is a final (rather than interim) result
- public TextAlternative[] TextAlternatives [get and set]
  - Array of text transcription alternatives
- public SpeechToTextResult()
  - Default class constructor.
- public SpeechToTextResult(string text, bool isFinal)
  - Class constructor given a single string text alternative and whether the result is final.
TextAlternative
- public string Text [get and set]
  - The text transcription itself

AudioRecordingManager functions and properties

public int RecordingFrequency [set]
- Frequency (samples-per-second) at which to record
public int MaxRecordingLengthInSeconds [set]
- Number of seconds to record before the recording times out
public AudioClip RecordedAudio [get]
- Audio clip created from the most recent recording
public void RegisterOnTimeout(Action action)
- Adds a function to the timeout delegate.
public void UnregisterOnTimeout(Action action)
- Removes a function from the timeout delegate.
public bool IsRecording()
- Queries if the default device is currently recording.
- Returns whether the default device is currently recording.
private IEnumerator WaitForRecordingTimeout()
- Waits for the default device to stop recording and checks if this was due to a timeout.
public IEnumerator RecordAndWaitUntilDone()
- Starts a recording session and waits until it finishes.
public void StartRecording()
- Tells the default device to start recording if it is not already.
public void StopRecording()
- If the default device is recording, ends the recording session and trims the default audio clip produced.
public AudioClip GetChunkOfRecordedAudio(float offsetInSeconds, float chunkLengthInSeconds)
- Creates and returns a specific chunk of audio from the current recording.
- Returns the audio chunk or null if the chunk length is less than or equal to 0 or if the offset is greater than or equal to the recorded audio length.

SmartLogger and DebugFlags

SmartLogger is a wrapper for the UnityEngine.Debug logger that can be used to only log debug messages when explicitly specified given a debug flag.
DebugFlags contains static flags that can be passed to the SmartLogger functions. To create your own flag, simply add a public static DebugFlag member variable to this class and construct it with your desired flag name and boolean value.

Example of speech-to-text service usage

Create a game object with a "[Specific Name Here] Speech To Text Service" component, and fill in all necessary fields in the Inspector. (Of course, you may instead choose to do all this programmatically.)
Assign a reference to [SpecificNameHere]SpeechToTextService (in this example we will call the reference m_SpeechToTextService) to the script that will be interacting with [SpecificNameHere]SpeechToTextService.
Add the following functions to your script.

void OnError(string text)
{
    Debug.LogError(text);
}

// Note that handling interim results is only necessary if your speech-to-text service is streaming.
// Non-streaming speech-to-text services should only return one result per recording.
void OnTextResult(SpeechToTextResult result)
{
    if (result.IsFinal)
    {
        Debug.Log("Final result:");
    }
    else
    {
        Debug.Log("Interim result:");
    }
    for (int i = 0; i < result.TextAlternatives.Length; ++i)
    {
        Debug.Log("Alternative " + i + ": " + result.TextAlternatives[i].Text);
    }
}

void OnRecordingTimeout()
{
    Debug.Log("Timeout");
}

Add the following code in a place that will be guaranteed to execute before you call m_SpeechToTextService.StartRecording(). (Most of the time this code should just be in either Start() or Awake(), assuming your script inherits from MonoBehaviour.)

m_SpeechToTextService.RegisterOnError(OnError);
m_SpeechToTextService.RegisterOnTextResult(OnTextResult);
m_SpeechToTextService.RegisterOnRecordingTimeout(OnRecordingTimeout);

(Optional) If at some point in execution you want to stop handling results within this script, add the following code to that place. (A good example is within OnDestroy().)

m_SpeechToTextService.UnregisterOnError(OnError);
m_SpeechToTextService.UnregisterOnTextResult(OnTextResult);
m_SpeechToTextService.UnregisterOnRecordingTimeout(OnRecordingTimeout);

Create a hook for when you want to start recording, and in it add m_SpeechToTextService.StartRecording(). Create a hook for when you want to stop recording, and in it add m_SpeechToTextService.StopRecording(). These hooks could be functions called upon button presses, for example.
(Optional) If the specific speech-to-text service you choose uses AudioRecordingManager to record audio, you can assign the public properties of AudioRecordingManager in a place that will be guaranteed to execute before you call m_SpeechToTextService.StartRecording(). This can be done from the Inspector if you create a game object with a "Audio Recording Manager" component, or from script by using AudioRecordingManager.Instance. (In the latter case, do not worry about creating a game object with this component - the MonoSingleton implementation will take care of that for you.)

Forks

The BitBucket repository for this project can be found here. Anyone in the community is welcome to create their own forks. Drop us a note at labs@unity3d.com if you find it useful, we'd love to hear from you!

最后编辑于：2017.12.04 04:06:35

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,039评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,426评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,417评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,868评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,892评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,692评论 1赞 305
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,416评论 3赞 419
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,326评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,782评论 1赞 316
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,957评论 3赞 337
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,102评论 1赞 350
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,790评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,442评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,996评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,113评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,332评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,044评论 2赞 355

Speech-to-Text

Speech-to-Text

Speech-to-Text

Table of Contents

Requirements

Setting up the sample scene

Recording and comparing results

Acquiring credentials

Google Cloud Speech

IBM Watson Speech to Text

Wit.ai

Architecture

Namespaces

Speech-to-text services and results inheritance hierarchy

Speech-to-text services and results base functions and properties

AudioRecordingManager functions and properties

SmartLogger and DebugFlags

Example of speech-to-text service usage

Forks

推荐阅读更多精彩内容