Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit: Add task audio-text-to-text #1212

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions packages/tasks/src/tasks/audio-text-to-text/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Different Types of Audio-Text-to-Text Models

Audio-text-to-text models can be categorized into two main types:

- **Base:**
Pre-trained models that extract rich audio features using techniques such as Wav2Vec, HuBERT, or Whisper. These models serve as the backbone for various downstream tasks. An example is the [Qwen2-Audio-7b](https://huggingface.co/Qwen/Qwen2-Audio-7B), which can be further fine-tuned.

- **Instruction:**
Base models fine-tuned on specialized audio instruction datasets to better handle task-specific querie and conversation. For instance, [Ichigo-llama3.1-s-instruct-v0.4](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.4) has been optimized to follow detailed audio-related commands.

### Use Cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd give examples for all of these use-cases!


- **Multimodal Audio Dialogue:**
These models can engage in real-time, multi-turn conversations by processing audio inputs and generating text responses. They are the backbone of advanced voice assistants and interactive dialogue systems.

- **Speech Transcription and Analysis:**
Beyond converting spoken words to text, these models capture prosody, emotion, and speaker characteristics. This enriched transcription can be used for applications such as sentiment analysis and speaker profiling.

- **Audio Question Answering:**
By directly processing audio inputs, the models can answer questions about the content of an audio clip—whether it’s a podcast excerpt or a recorded conversation.

- **Audio Command Recognition and Automation:**
Voice-controlled applications, from smart home devices to computer interfaces, benefit from models that can understand and execute complex spoken commands.

- **Voice-Based Computer Use:**
Models can control computing workflows by parsing spoken instructions, making interactions more natural and accessible.


### Useful Resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put ultravox, ichigo, ultravox as resources here


60 changes: 60 additions & 0 deletions packages/tasks/src/tasks/audio-text-to-text/data.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import type { TaskDataCustom } from "../index.js";

const taskData: TaskDataCustom = {
datasets: [
{
description: "Instructions composed of audio and text.",
id: "homebrewltd/instruction-speech-encodec-v1.5",
},
],
demo: {
inputs: [
{
filename: "sample-audio.wav",
type: "audio",
},
{
label: "Text Prompt",
content: "Transcribe and describe what is being said in the audio.",
type: "text",
},
],
outputs: [
{
label: "Answer",
content: "The audio contains a person explaining a recipe for chocolate chip cookies. They describe mixing butter and sugar first, then adding eggs and vanilla extract, followed by the dry ingredients.",
type: "text",
},
],
},
metrics: [],
models: [
{
description: "Small yet powerful audio language model.",
id: "fixie-ai/ultravox-v0_5-llama-3_2-1b",
},
{
description: "Audio Language Model based on Llama 3.1. 8b",
id: "homebrewltd/Ichigo-llama3.1-s-instruct-v0.4",
},
{
description: "Strong Audio Language Model.",
id: "Qwen/Qwen2-Audio-7B",
},
],
spaces: [
{
description: "Powerful audio-language model assistant.",
id: "Qwen/Qwen2-Audio-Instruct-Demo",
},
{
description: "Real-time audio-text-to-text model.",
id: "Steveeeeeeen/talk-to-ultravox-0.5",
},
],
summary:
"Audio-text-to-text models extend multimodal AI into the speech domain. Much like their visual counterparts, these models are designed to understand and generate text based on audio inputs. Recent research in spoken dialogue systems and Speech Large Language Models (LLMs) highlights how such models are evolving, leveraging both semantic and acoustic representations extracted from speech signals.",
widgetModels: [],
};

export default taskData;
Loading