Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal UX - Audio Component #1112

Merged
merged 35 commits into from
Feb 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
2c47707
Fix for environment detection in barebones environments.
nopdive Nov 11, 2024
11c60b1
Minor clean-up of env detection.
nopdive Nov 11, 2024
d721a5f
Merge branch 'guidance-ai:main' into main
nopdive Nov 14, 2024
a958ac6
Merge branch 'guidance-ai:main' into main
nopdive Dec 10, 2024
428b81c
Merge branch 'guidance-ai:main' into main
nopdive Dec 11, 2024
4aa9913
Merge branch 'guidance-ai:main' into main
nopdive Jan 22, 2025
6cd2ba0
Audio and video API primitives added.
nopdive Jan 22, 2025
3153368
Refactor of byte parsing for multi-modal primitives.
nopdive Jan 22, 2025
8972b04
Added additional API primitives for modal outputs.
nopdive Jan 22, 2025
45ecde5
Added gen multi modal primitives to guidance top-level API.
nopdive Jan 22, 2025
b813d6d
Trace nodes updated to handle audio and video.
nopdive Jan 22, 2025
dffec01
Connecting audio/video to model class.
nopdive Jan 23, 2025
0dad4b6
Base64 encoding for modal messages.
nopdive Jan 23, 2025
7338f04
Minor additions.
nopdive Jan 23, 2025
94ebcf4
Added interfaces to client for modal.
nopdive Jan 23, 2025
abd54fa
Updated manifest for sample assets.
nopdive Jan 23, 2025
5557fe0
Audio / video added to app for client.
nopdive Jan 23, 2025
aee2656
Code reformat on some client files.
nopdive Jan 23, 2025
0d054f4
Added bundle.
nopdive Jan 23, 2025
c0df8d1
hacky initial audio prototype
nking-1 Jan 31, 2025
48c3e7b
Custom audio widget draft - trying to draw waveform
nking-1 Feb 4, 2025
cbf2dca
Waveform height bars fixed
nking-1 Feb 4, 2025
594d886
Added image mock via guidance library.
nopdive Feb 5, 2025
30163f0
Merge branch 'multimodal-surfaces' of https://github.com/guidance-ai/…
nopdive Feb 5, 2025
b087d2f
Fix closing bracket & format
nking-1 Feb 5, 2025
487a507
New bundle
nking-1 Feb 5, 2025
5330838
New line for audio widget
nking-1 Feb 6, 2025
5f9fb17
Added missing sample image.
nopdive Feb 6, 2025
1c2d1a9
Merge branch 'multimodal-surfaces' of https://github.com/guidance-ai/…
nopdive Feb 6, 2025
e467879
Fix some video output and image output pipeline bugs
nking-1 Feb 11, 2025
94c0588
video and image rendering on front end (no styling or controls yet)
nking-1 Feb 12, 2025
a54a143
Merge branch 'main' into multimodal-surfaces
nking-1 Feb 13, 2025
08f5290
Fix merge regressions
nking-1 Feb 13, 2025
2df401a
Rewrite image tests as pseudocode placeholders
nking-1 Feb 13, 2025
93e69a4
Try adding setuptools install dependency to fix build
nking-1 Feb 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
include resources/graphpaper-inline.html
include resources/graphpaper-inline.html
include resources/sample_audio.wav
include resources/sample_video.mp4
include resources/sample_image.png
14 changes: 13 additions & 1 deletion client/graphpaper-inline/src/App.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,19 @@ For upcoming features, we won't be able to send all details over the wire, and w
clientmsg,
type GenTokenExtra,
type GuidanceMessage,
isAudioOutput,
isClientReadyAckMessage,
isExecutionCompletedMessage,
isExecutionStartedMessage,
isImageOutput,
isMetricMessage,
isResetDisplayMessage,
isRoleCloserInput,
isRoleOpenerInput,
isTextOutput,
isTokensMessage,
isTraceMessage,
isVideoOutput,
kernelmsg,
type NodeAttr,
state,
Expand Down Expand Up @@ -75,6 +78,15 @@ For upcoming features, we won't be able to send all details over the wire, and w
appState.textComponents.push(msg.node_attr);
} else if (isRoleCloserInput(msg.node_attr)) {
appState.textComponents.push(msg.node_attr);
} else if (isAudioOutput(msg.node_attr)) {
console.log("Audio available")
appState.textComponents.push(msg.node_attr);
} else if (isImageOutput(msg.node_attr)) {
console.log("Image available")
appState.textComponents.push(msg.node_attr);
} else if (isVideoOutput(msg.node_attr)) {
console.log("Video available")
appState.textComponents.push(msg.node_attr);
}
} else if (isExecutionStartedMessage(msg)) {
appState.requireFullReplay = false;
Expand Down Expand Up @@ -212,4 +224,4 @@ For upcoming features, we won't be able to send all details over the wire, and w
isError={appState.status === Status.Error}
bgField={bgField} underlineField={underlineField} requireFullReplay="{appState.requireFullReplay}" />
</section>
</div>
</div>
206 changes: 206 additions & 0 deletions client/graphpaper-inline/src/CustomAudio.svelte
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
<script>
import { onMount } from "svelte";
export let audioData; // Base64 data (without the data URL header)

let audio;
let isPlaying = false;
let progress = 0;
let duration = 0;
let currentTime = 0;
let volume = 1;
let waveformCanvas;

function togglePlay() {
if (audio.paused) {
audio.play();
isPlaying = true;
} else {
audio.pause();
isPlaying = false;
}
}

function updateProgress() {
if (audio) {
progress = (audio.currentTime / audio.duration) * 100;
currentTime = audio.currentTime;
duration = audio.duration || 0;
}
}

function seek(event) {
const seekBar = event.currentTarget;
const seekPosition = (event.offsetX / seekBar.offsetWidth) * audio.duration;
audio.currentTime = seekPosition;
}

function changeVolume(event) {
volume = event.target.value;
audio.volume = volume;
}

function formatTime(seconds) {
const min = Math.floor(seconds / 60);
const sec = Math.floor(seconds % 60);
return `${min}:${sec < 10 ? "0" : ""}${sec}`;
}

// When audio finishes, reset the play state
function handleEnded() {
isPlaying = false;
progress = 0;
currentTime = 0;
}

// Helper: convert base64 string to ArrayBuffer
function base64ToArrayBuffer(base64) {
const binaryString = atob(base64);
const len = binaryString.length;
const bytes = new Uint8Array(len);
for (let i = 0; i < len; i++) {
bytes[i] = binaryString.charCodeAt(i);
}
return bytes.buffer;
}

// Decode the audio, downsample it, and draw the waveform onto the canvas.
async function drawWaveform() {
if (!audioData || !waveformCanvas) return;
const audioContext = new AudioContext();
const arrayBuffer = base64ToArrayBuffer(audioData);
try {
const decodedData = await audioContext.decodeAudioData(arrayBuffer);
const rawData = decodedData.getChannelData(0); // use first channel

// Ensure the canvas has the proper pixel dimensions.
const canvas = waveformCanvas;
canvas.width = canvas.clientWidth;
canvas.height = canvas.clientHeight;
const width = canvas.width;
const height = canvas.height;

// Downsample the raw data to one value per pixel.
const samples = width;
const blockSize = Math.floor(rawData.length / samples);
const waveform = new Array(samples);
for (let i = 0; i < samples; i++) {
let sum = 0;
for (let j = 0; j < blockSize; j++) {
sum += Math.abs(rawData[i * blockSize + j]);
}
waveform[i] = sum / blockSize;
}

// Normalize the waveform data so that the maximum amplitude maps to the full canvas height.
const maxAmp = Math.max(...waveform);
// Prevent division by zero
const scale = maxAmp > 0 ? 1 / maxAmp : 1;

// Draw the waveform: each pixel column gets a vertical bar.
const ctx = canvas.getContext("2d");
ctx.clearRect(0, 0, width, height);
ctx.fillStyle = "#2979ff"; // same blue as your seek bar
for (let i = 0; i < samples; i++) {
const x = i;
// Normalize the amplitude and then scale to the canvas height.
const normalizedAmp = waveform[i] * scale;
const barHeight = normalizedAmp * height;
// Center the bar vertically.
const y = (height - barHeight) / 2;
ctx.fillRect(x, y, 1, barHeight);
}
} catch (error) {
console.error("Error decoding audio for waveform:", error);
}
}

onMount(() => {
drawWaveform();
});
</script>

<!--
New layout: We wrap the waveform canvas and seek bar in a flex-col so that
the waveform sits directly above the seek bar.
-->
<div class="rounded-[10px] border border-gray-400 bg-white p-[10px] w-[400px]">
<div class="flex items-center gap-[10px]">
<!-- Play Button -->
<div
class="w-[40px] h-[40px] rounded-full bg-[#6c7a89] flex items-center justify-center cursor-pointer"
on:click={togglePlay}
on:keydown={togglePlay}
role="button"
tabindex="0"
aria-label="Toggle playback"
>
{#if isPlaying}
<svg class="fill-white w-[20px] h-[20px]" viewBox="0 0 24 24">
<rect x="6" y="5" width="4" height="14" />
<rect x="14" y="5" width="4" height="14" />
</svg>
{:else}
<svg class="fill-white w-[20px] h-[20px]" viewBox="0 0 24 24">
<polygon points="5,3 19,12 5,21" />
</svg>
{/if}
</div>

<!-- Waveform and Seek Bar Column -->
<div class="flex flex-col flex-grow gap-1">
<!-- Waveform Canvas -->
<canvas bind:this={waveformCanvas} class="w-full h-12"></canvas>

<!-- Seek Bar -->
<div
class="h-[6px] bg-[#ddd] rounded-[3px] cursor-pointer relative"
on:click={seek}
on:keydown={seek}
role="slider"
tabindex="0"
aria-label="Seek"
aria-valuemin="0"
aria-valuemax="100"
aria-valuenow={progress}
>
<div
class="h-full bg-[#2979ff] rounded-[3px] absolute"
style="width: {progress}%"
></div>
</div>
</div>

<!-- Time Display -->
<div class="text-sm text-[#555] min-w-[50px]">
{formatTime(currentTime)} / {formatTime(duration)}
</div>

<!-- Volume Control -->
<div class="flex items-center gap-[5px]">
<svg class="w-[20px] h-[20px]" viewBox="0 0 24 24">
<path
d="M3 9v6h4l5 5V4L7 9H3zm13.5 3c0-1.77-1.02-3.29-2.5-4.03v8.07c1.48-.74 2.5-2.26 2.5-4.04z"
></path>
</svg>
<input
type="range"
min="0"
max="1"
step="0.01"
bind:value={volume}
on:input={changeVolume}
class="w-[60px]"
/>
</div>
</div>

<!-- Audio Element (hidden) -->
<!-- Note - need to propagate type from back end maybe? -->
<audio
bind:this={audio}
on:timeupdate={updateProgress}
on:ended={handleEnded}
src={"data:audio/wav;base64," + audioData}
class="hidden"
></audio>
</div>
41 changes: 40 additions & 1 deletion client/graphpaper-inline/src/TokenGrid.svelte
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
<!-- Token grid that exposes each token and hover info. -->
<script lang="ts">
import {isRoleOpenerInput, isTextOutput, type NodeAttr, type RoleOpenerInput, type GenTokenExtra} from './stitch';
import {isRoleOpenerInput, isTextOutput, isAudioOutput, type NodeAttr, type RoleOpenerInput, type GenTokenExtra, isImageOutput, isVideoOutput} from './stitch';
import CustomAudio from "./CustomAudio.svelte";
import TokenGridItem from "./TokenGridItem.svelte";
import {type Token, type TokenCallback} from "./interfaces";
import {longhover} from "./longhover";
Expand Down Expand Up @@ -133,6 +134,9 @@
return [overlapped, noSpecialOverride];
}

let audioNode: any = null; // Store the first audio node found (hack)
let imageNode: any = null; // Store the first image node found (hack)
let videoNode: any = null; // Store the first video node found (hack)
let tokens: Array<Token> = [];
let activeOpenerRoles: Array<RoleOpenerInput> = [];
let activeCloserRoleText: Array<string> = [];
Expand Down Expand Up @@ -195,6 +199,18 @@
tokens.push(token);
activeOpenerRoles.pop();
}
} else if (isAudioOutput(nodeAttr)) {
if (audioNode === null) {
audioNode = nodeAttr;
}
} else if (isImageOutput(nodeAttr)) {
if (imageNode === null) {
imageNode = nodeAttr;
}
} else if (isVideoOutput(nodeAttr)) {
if (videoNode === null) {
videoNode = nodeAttr;
}
}
}
// NOTE(nopdive): Often the closer text is missing at the end of output.
Expand Down Expand Up @@ -515,5 +531,28 @@
</span>
{/if}
</span>

{#if audioNode !== null}
<div class="my-3">
<CustomAudio audioData={audioNode.value} />
</div>
{/if}

{#if videoNode !== null}
<div class="my-3">
<video controls>
<!-- Note - need to propagate type from back end maybe? -->
<source src={`data:video/mp4;base64,${videoNode.value}`} type="video/mp4"/>
</video>
</div>
{/if}

{#if imageNode !== null}
<div class="my-3">
<!-- Note - need to propagate type from back end maybe? -->
<img src={`data:image/png;base64,${imageNode.value}`} alt="Image output"/>
</div>
{/if}

</div>
</div>
33 changes: 33 additions & 0 deletions client/graphpaper-inline/src/stitch.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,24 @@ export interface TextOutput extends NodeAttr {
prob: number,
}

export interface ImageOutput extends NodeAttr {
class_name: 'ImageOutput',
value: string,
is_input: boolean,
}

export interface AudioOutput extends NodeAttr {
class_name: 'AudioOutput',
value: string,
is_input: boolean,
}

export interface VideoOutput extends NodeAttr {
class_name: 'VideoOutput',
value: string,
is_input: boolean,
}

export interface RoleOpenerInput extends NodeAttr {
class_name: 'RoleOpenerInput',
name?: string,
Expand Down Expand Up @@ -125,6 +143,21 @@ export function isTextOutput(o: NodeAttr | undefined | null): o is TextOutput {
return o.class_name === "TextOutput";
}

export function isImageOutput(o: NodeAttr | undefined | null): o is ImageOutput {
if (o === undefined || o === null) return false;
return o.class_name === "ImageOutput";
}

export function isAudioOutput(o: NodeAttr | undefined | null): o is AudioOutput {
if (o === undefined || o === null) return false;
return o.class_name === "AudioOutput";
}

export function isVideoOutput(o: NodeAttr | undefined | null): o is VideoOutput {
if (o === undefined || o === null) return false;
return o.class_name === "VideoOutput";
}

export function isResetDisplayMessage(o: GuidanceMessage | undefined | null): o is ResetDisplayMessage {
if (o === undefined || o === null) return false;
return o.class_name === "ResetDisplayMessage";
Expand Down
Loading
Loading