feat: Android 14+ language detection (#36)

jamsch · web-flow · commit dfdb98dfacd3 · 2024-11-05T14:37:04.000+13:00
* feat: android language detection

* register events

* feat: avoid re-sending language detection event if last detected language or confidence don't change

* chore: update docs for language detection
diff --git a/README.md b/README.md
@@ -24,6 +24,7 @@ expo-speech-recognition implements the iOS [`SFSpeechRecognizer`](https://develo
 - [Polyfilling the Web SpeechRecognition API](#polyfilling-the-web-speechrecognition-api)
 - [Muting the beep sound on Android](#muting-the-beep-sound-on-android)
 - [Improving accuracy of single-word prompts](#improving-accuracy-of-single-word-prompts)
+- [Language Detection](#language-detection)
 - [Platform Compatibility Table](#platform-compatibility-table)
 - [Common Troubleshooting issues](#common-troubleshooting-issues)
   - [Android issues](#android-issues)
@@ -36,14 +37,14 @@ expo-speech-recognition implements the iOS [`SFSpeechRecognizer`](https://develo
   - [getPermissionsAsync()](#getpermissionsasync-promisepermissionresponse)
   - [getStateAsync()](#getstateasync-promisespeechrecognitionstate)
   - [addSpeechRecognitionListener()](#addspeechrecognitionlistener)
-  - [getSupportedLocales()](#getsupportedlocales-promise-locales-string-installedlocales-string-)
+  - [getSupportedLocales()](#getsupportedlocales)
   - [getSpeechRecognitionServices()](#getspeechrecognitionservices-string-android-only)
   - [getDefaultRecognitionService()](#getdefaultrecognitionservice--packagename-string--android-only)
   - [getAssistantService()](#getassistantservice--packagename-string--android-only)
   - [isRecognitionAvailable()](#isrecognitionavailable-boolean)
   - [supportsOnDeviceRecognition()](#supportsondevicerecognition-boolean)
   - [supportsRecording()](#supportsrecording-boolean)
-  - [androidTriggerOfflineModelDownload()](#androidtriggerofflinemodeldownload-locale-string--promise-status-opened_dialog--download_success--download_canceled-message-string-)
+  - [androidTriggerOfflineModelDownload()](#androidtriggerofflinemodeldownload)
   - [setCategoryIOS()](#setcategoryios-void-ios-only)
   - [getAudioSessionCategoryAndOptionsIOS()](#getaudiosessioncategoryandoptionsios-ios-only)
   - [setAudioSessionActiveIOS()](#setaudiosessionactiveiosvalue-boolean-options--notifyothersondeactivation-boolean--void)
@@ -322,18 +323,19 @@ ExpoSpeechRecognitionModule.abort();
 
 Events are largely based on the [Web Speech API](https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition). The following events are supported:
 
-| Event Name     | Description                                                                                | Notes                                                                                                                                                                                                                                                                                    |
-| -------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `audiostart`   | Audio capturing has started                                                                | Includes the `uri` if `recordingOptions.persist` is enabled.                                                                                                                                                                                                                             |
-| `audioend`     | Audio capturing has ended                                                                  | Includes the `uri` if `recordingOptions.persist` is enabled.                                                                                                                                                                                                                             |
-| `end`          | Speech recognition service has disconnected.                                               | This should always be the last event dispatched, including after errors.                                                                                                                                                                                                                 |
-| `error`        | Fired when a speech recognition error occurs.                                              | You'll also receive an `error` event (with code "aborted") when calling `.abort()`                                                                                                                                                                                                       |
-| `nomatch`      | Speech recognition service returns a final result with no significant recognition.         | You may have non-final results recognized. This may get emitted after cancellation.                                                                                                                                                                                                      |
-| `result`       | Speech recognition service returns a word or phrase has been positively recognized.        | On Android, continous mode runs as a segmented session, meaning when a final result is reached, additional partial and final results will cover a new segment separate from the previous final result. On iOS, you should expect one final result before speech recognition has stopped. |
-| `speechstart`  | Fired when any sound — recognizable speech or not — has been detected                      | On iOS, this will fire once in the session after a result has occurred                                                                                                                                                                                                                   |
-| `speechend`    | Fired when speech recognized by the speech recognition service has stopped being detected. | Not supported yet on iOS                                                                                                                                                                                                                                                                 |
-| `start`        | Speech recognition has started                                                             | Use this event to indicate to the user when to speak.                                                                                                                                                                                                                                    |
-| `volumechange` | Fired when the input volume changes.                                                       | Returns a value between -2 and 10 indicating the volume of the input audio. Consider anything below 0 to be inaudible.                                                                                                                                                                   |
+| Event Name          | Description                                                                                | Notes                                                                                                                                                                                                                                                                                    |
+| ------------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `audiostart`        | Audio capturing has started                                                                | Includes the `uri` if `recordingOptions.persist` is enabled.                                                                                                                                                                                                                             |
+| `audioend`          | Audio capturing has ended                                                                  | Includes the `uri` if `recordingOptions.persist` is enabled.                                                                                                                                                                                                                             |
+| `end`               | Speech recognition service has disconnected.                                               | This should always be the last event dispatched, including after errors.                                                                                                                                                                                                                 |
+| `error`             | Fired when a speech recognition error occurs.                                              | You'll also receive an `error` event (with code "aborted") when calling `.abort()`                                                                                                                                                                                                       |
+| `nomatch`           | Speech recognition service returns a final result with no significant recognition.         | You may have non-final results recognized. This may get emitted after cancellation.                                                                                                                                                                                                      |
+| `result`            | Speech recognition service returns a word or phrase has been positively recognized.        | On Android, continous mode runs as a segmented session, meaning when a final result is reached, additional partial and final results will cover a new segment separate from the previous final result. On iOS, you should expect one final result before speech recognition has stopped. |
+| `speechstart`       | Fired when any sound — recognizable speech or not — has been detected                      | On iOS, this will fire once in the session after a result has occurred                                                                                                                                                                                                                   |
+| `speechend`         | Fired when speech recognized by the speech recognition service has stopped being detected. | Not supported yet on iOS                                                                                                                                                                                                                                                                 |
+| `start`             | Speech recognition has started                                                             | Use this event to indicate to the user when to speak.                                                                                                                                                                                                                                    |
+| `volumechange`      | Fired when the input volume changes.                                                       | Returns a value between -2 and 10 indicating the volume of the input audio. Consider anything below 0 to be inaudible.                                                                                                                                                                   |
+| `languagedetection` | Called when the language detection (and switching) results are available.                  | Android 14+ only with `com.google.android.as`. Enabled with `EXTRA_ENABLE_LANGUAGE_DETECTION` in the `androidIntent` option when starting. Also can be called multiple times by enabling `EXTRA_ENABLE_LANGUAGE_SWITCH`.                                                                 |
 
 ## Handling Errors
 
@@ -696,6 +698,39 @@ You may notice that after saying short syllables, words, letters, or numbers (e.
 - For both platforms, you also may want to consider using on-device recognition. On Android this seems to work well for single-word prompts.
 - Alternatively, you may want to consider recording the recognized audio and sending it to an external service for further processing. See [Persisting Audio Recordings](#persisting-audio-recordings) for more information. Note that some services (such as the Google Speech API) may require an audio file with a duration of at least 3 seconds.
 
+## Language Detection
+
+> [!NOTE]
+> This feature is currently only available on Android 14+ using the `com.google.android.as` service package.
+
+You can use the `languagedetection` event to get the detected language and confidence level. This feature has a few requirements:
+
+- Android 14+ only.
+- The `com.google.android.as` (on-device recognition) service package must be selected. This seems to be the only service that supports language detection as of writing this.
+- You must enable `EXTRA_ENABLE_LANGUAGE_DETECTION` in the `androidIntentOptions` when starting the recognition.
+- Optional: You can enable `EXTRA_ENABLE_LANGUAGE_SWITCH` to allow the user to switch languages, however **keep in mind that you need the language model to be downloaded for this to work**. Refer to [androidTriggerOfflineModelDownload()](#androidtriggerofflinemodeldownload) to download a model, and [getSupportedLocales()](#getsupportedlocales) to get the list of downloaded on-device locales.
+
+Example:
+
+```tsx
+import { useSpeechRecognitionEvent } from "expo-speech-recognition";
+
+useSpeechRecognitionEvent("languagedetection", (event) => {
+  console.log("Language detected:", event.detectedLanguage); // e.g. "en-us"
+  console.log("Confidence:", event.confidence); // A value between 0.0 and 1.0
+  console.log("Top locale alternatives:", event.topLocaleAlternatives); // e.g. ["en-au", "en-gb"]
+});
+
+// Start recognition
+ExpoSpeechRecognitionModule.start({
+  androidIntentOptions: {
+    EXTRA_ENABLE_LANGUAGE_DETECTION: true,
+    EXTRA_ENABLE_LANGUAGE_SWITCH: true,
+  },
+  androidRecognitionServicePackage: "com.google.android.as", // or set "requiresOnDeviceRecognition" to true
+});
+```
+
 ## Platform Compatibility Table
 
 As of 7 Aug 2024, the following platforms are supported:
@@ -852,7 +887,7 @@ const listener = addSpeechRecognitionListener("result", (event) => {
 listener.remove();
 ```
 
-### `getSupportedLocales(): Promise<{ locales: string[]; installedLocales: string[] }>`
+### `getSupportedLocales()`
 
 > [!NOTE]
 > Not supported on Android 12 and below
@@ -966,7 +1001,7 @@ const available = supportsRecording();
 console.log("Recording available:", available);
 ```
 
-### `androidTriggerOfflineModelDownload({ locale: string }): Promise<{ status: "opened_dialog" | "download_success" | "download_canceled", message: string }>`
+### `androidTriggerOfflineModelDownload()`
 
 Users on Android devices will first need to download the offline model for the locale they want to use in order to use on-device speech recognition (i.e. the `requiresOnDeviceRecognition` setting in the `start` options).
 
diff --git a/android/src/main/java/expo/modules/speechrecognition/ExpoSpeechRecognitionModule.kt b/android/src/main/java/expo/modules/speechrecognition/ExpoSpeechRecognitionModule.kt
@@ -86,6 +86,8 @@ class ExpoSpeechRecognitionModule : Module() {
                 "start",
                 // Called when there's results (as a string array, not API compliant)
                 "results",
+                // Called when the language detection (and switching) results are available.
+                "languagedetection",
                 // Fired when the input volume changes
                 "volumechange",
             )
diff --git a/android/src/main/java/expo/modules/speechrecognition/ExpoSpeechService.kt b/android/src/main/java/expo/modules/speechrecognition/ExpoSpeechService.kt
@@ -60,6 +60,9 @@ class ExpoSpeechService(
     private var delayedFileStreamer: DelayedFileStreamer? = null
     private var soundState = SoundState.INACTIVE
 
+    private var lastDetectedLanguage: String? = null
+    private var lastLanguageConfidence: Float? = null
+
     var recognitionState = RecognitionState.INACTIVE
 
     companion object {
@@ -121,6 +124,8 @@ class ExpoSpeechService(
             audioRecorder = null
             delayedFileStreamer?.close()
             delayedFileStreamer = null
+            lastDetectedLanguage = null
+            lastLanguageConfidence = null
             recognitionState = RecognitionState.STARTING
             soundState = SoundState.INACTIVE
             lastVolumeChangeEventTime = 0L
@@ -435,7 +440,7 @@ class ExpoSpeechService(
         when {
             // File URI
             sourceUri.startsWith("file://") -> File(URI(sourceUri))
-            
+
             // Local file path without URI scheme
             !sourceUri.startsWith("https://") -> File(sourceUri)
 
@@ -581,6 +586,15 @@ class ExpoSpeechService(
             else -> 0.0f
         }
 
+    private fun languageDetectionConfidenceLevelToFloat(confidenceLevel: Int): Float =
+        when (confidenceLevel) {
+            SpeechRecognizer.LANGUAGE_DETECTION_CONFIDENCE_LEVEL_HIGHLY_CONFIDENT -> 1.0f
+            SpeechRecognizer.LANGUAGE_DETECTION_CONFIDENCE_LEVEL_CONFIDENT -> 0.8f
+            SpeechRecognizer.LANGUAGE_DETECTION_CONFIDENCE_LEVEL_NOT_CONFIDENT -> 0.5f
+            SpeechRecognizer.LANGUAGE_DETECTION_CONFIDENCE_LEVEL_UNKNOWN -> 0f
+            else -> 0.0f
+        }
+
     override fun onResults(results: Bundle?) {
         val resultsList = getResults(results)
 
@@ -614,6 +628,26 @@ class ExpoSpeechService(
         }
     }
 
+    override fun onLanguageDetection(results: Bundle) {
+        val detectedLanguage = results.getString(SpeechRecognizer.DETECTED_LANGUAGE)
+        val confidence = languageDetectionConfidenceLevelToFloat(results.getInt(SpeechRecognizer.LANGUAGE_DETECTION_CONFIDENCE_LEVEL))
+
+        // Only send event if language or confidence has changed
+        if (detectedLanguage != lastDetectedLanguage || confidence != lastLanguageConfidence) {
+            lastDetectedLanguage = detectedLanguage
+            lastLanguageConfidence = confidence
+
+            sendEvent(
+                "languagedetection",
+                mapOf(
+                    "detectedLanguage" to detectedLanguage,
+                    "confidence" to confidence,
+                    "topLocaleAlternatives" to results.getStringArrayList(SpeechRecognizer.TOP_LOCALE_ALTERNATIVES),
+                ),
+            )
+        }
+    }
+
     /**
      * For API 33: Basically same as onResults but doesn't stop
      */
diff --git a/example/App.tsx b/example/App.tsx
@@ -93,8 +93,6 @@ export default function App() {
     const transcript = ev.results[0]?.transcript || "";
 
     setTranscription((current) => {
-      // When a final result comes in, we need to update the base transcript to build off from
-      // Because on Android and Web, multiple final results can be returned within a continuous session
       // When a final result is received, any following recognized transcripts will omit the previous final result
       const transcriptTally = ev.isFinal
         ? (current?.transcriptTally ?? "") + transcript
@@ -126,6 +124,10 @@ export default function App() {
     console.log("[event]: nomatch");
   });
 
+  useSpeechRecognitionEvent("languagedetection", (ev) => {
+    console.log("[event]: languagedetection", ev);
+  });
+
   const startListening = () => {
     if (status !== "idle") {
       return;
@@ -574,7 +576,12 @@ function GeneralSettings(props: {
                     : locale
                 }
                 active={settings.lang === locale}
-                onPress={() => handleChange("lang", locale)}
+                onPress={() =>
+                  handleChange(
+                    "lang",
+                    settings.lang === locale ? undefined : locale,
+                  )
+                }
               />
             );
           })}
diff --git a/ios/ExpoSpeechRecognitionModule.swift b/ios/ExpoSpeechRecognitionModule.swift
@@ -94,6 +94,8 @@ public class ExpoSpeechRecognitionModule: Module {
       "start",
       // Called when there's results (as a string array, not API compliant)
       "result",
+      // Called when the language detection (and switching) results are available.
+      "languagedetection",
       // Fired when the input volume changes
       "volumechange"
     )
diff --git a/src/ExpoSpeechRecognitionModule.types.ts b/src/ExpoSpeechRecognitionModule.types.ts

Original file line number	Diff line number	Diff line change
`@@ -86,6 +86,8 @@ class ExpoSpeechRecognitionModule : Module() {`
`86`	`86`	`"start",`
`87`	`87`	`// Called when there's results (as a string array, not API compliant)`
`88`	`88`	`"results",`
	`89`	`+ // Called when the language detection (and switching) results are available.`
	`90`	`+ "languagedetection",`
`89`	`91`	`// Fired when the input volume changes`
`90`	`92`	`"volumechange",`
`91`	`93`	`)`
Original file line number	Diff line number	Diff line change
`@@ -94,6 +94,8 @@ public class ExpoSpeechRecognitionModule: Module {`
`94`	`94`	`"start",`
`95`	`95`	`// Called when there's results (as a string array, not API compliant)`
`96`	`96`	`"result",`
	`97`	`+ // Called when the language detection (and switching) results are available.`
	`98`	`+ "languagedetection",`
`97`	`99`	`// Fired when the input volume changes`
`98`	`100`	`"volumechange"`
`99`	`101`	`)`