Examples of speech methods — quick start
High-level methods
Use these methods on your avatar instance:
say(input, opts?, onSubtitles?)— smart router; picks the right path based oninput.speakWithWords(input, opts?, onSubtitles?)— audio + words → derives visemes via lipsync.speakWithVisemes(input, opts?, onSubtitles?)— audio + visemes (already annotated).animateWords(input, opts?, onSubtitles?)— words → animation only (TEMP: disabled/unstable).animateVisemes(input, opts?, onSubtitles?)— visemes → animation only (TEMP: disabled/unstable).speakText(textOrSsml, opts?, onSubtitles?)— TTS path (text/SSML → audio → playback).speakAudio(input, opts?, onSubtitles?)— play audio only (markers/extraAnim optional).
Common opts
{
// timing / normalization
trimStartMs?: number,
trimEndMs?: number,
trailingBreakMs?: number,
animTailMs?: number,
applyTrimShift?: boolean,
// lipsync (separate from TTS voice/language)
lipsyncLang?: string, // e.g. 'en', 'es', 'de' (currently only English supported)
// gestures
enableLexicalGestures?: boolean // default: true; set false to disable lexical gestures
}
lipsyncLangonly affects lipsync/viseme derivation. It is not controlled by TTS options and must be set via opts (or at avatar init).
onSubtitles(payload)— if provided andwordsexist, the SDK will inject a per-word subtitles clip and attach the callback to the resulting items.
Input shapes
Audio (same everywhere)
audio: AudioBuffer | ArrayBuffer | ArrayBuffer[],
sampleRate?: number, // only for PCM in ArrayBuffer
timeUnit?: 'ms' | 's' // top-level default for nested sections (default 'ms')
Words
words: {
tokens: string[], // e.g. ["Hello","world"]
starts: number[], // word START times
durations: number[], // explicit durations
timeUnit?: 'ms' | 's'
}
Visemes
visemes: {
labels: string[], // e.g. ["AA","PP","FF",...]
starts: number[], // viseme START times
durations: number[], // explicit durations
timeUnit?: 'ms' | 's'
}
Markers (optional)
markers: {
labels: (string|function)[], // e.g. ["blink", () => console.log("beat")]
times: number[],
timeUnit?: 'ms' | 's'
}
Extra animation (optional)
Note: extraAnim is advanced and may be ignored by the default renderer.
extraAnim: Clip[], // passed through as-is
anim: any, // full animation object (if your factory supports it)
mode?: 'auto' | 'audio' | 'anim' // usually leave as-is; animate* forces 'anim'
Typical flows
1. Audio + words (visemes are derived automatically)
await avatar.speakWithWords({
audio: await fetchAsArrayBuffer('/sample.mp3'),
words: {
tokens: ['Hi','there','buddy'],
starts: [ 445, 640, 821 ],
durations: [ 215, 201, 648 ], // same unit as `starts`
// timeUnit: 'ms'
}
}, {
lipsyncLang: 'en'
}, (payload) => {
// render subtitles
});
2. Audio + precomputed visemes
await avatar.speakWithVisemes({
audio: await fetchAsArrayBuffer('/sample.mp3'),
visemes: {
labels: ['I', 'TH', 'E', 'RR', 'PP', 'aa', 'DD', 'I'],
starts: [496, 640, 665, 728, 821, 866, 960, 1046],
durations: [134, 45, 83, 63, 65, 114, 106, 423],
timeUnit: 'ms'
}
});
3. Smart router
Here’s how the smart route (the default say(...) path) currently works:
- audio + visemes → speakWithVisemes
- audio + words → speakWithWords
- audio only → speakAudio
- words only → animateWords (temporarily unstable/disabled)
- visemes only → animateVisemes (temporarily unstable/disabled)
- text / SSML → speakText (external TTS path)
This is the out-of-the-box behavior today. In a future update, you’ll be able to override this selection logic via opts (custom routing rules / priorities).
// audio + visemes → speakWithVisemes
await avatar.say({
audio: await fetchAsArrayBuffer('/sample.mp3'),
visemes: { labels:[...], starts:[...], durations:[...] }
});
// audio + words → speakWithWords
await avatar.say({
audio: await fetchAsArrayBuffer('/sample.mp3'),
words: { tokens:[...], starts:[...], durations:[...] }
}, { lipsyncLang: 'en' });
// audio only → speakAudio
await avatar.say({ audio: await fetchAsArrayBuffer('/sample.mp3') });
// words only → animateWords (TEMP disabled/unstable)
await avatar.say({ words: { tokens:[...], starts:[...], durations:[...] } });
// visemes only → animateVisemes (TEMP disabled/unstable)
await avatar.say({ visemes: { labels:[...], starts:[...], durations:[...] } });
// text/SSML → TTS
await avatar.say({ text: 'Hello there!' }, { lipsyncLang: 'en' });
4. Audio only
Experimental: may not work in all builds.
await avatar.speakAudio({
audio: await fetchAsArrayBuffer('/sample.mp3'),
// markers?: { labels:[...], times:[...], timeUnit:'ms' }
});
5. Text → TTS
await avatar.speakText('Hello there!', {
lipsyncLang: 'en',
enableLexicalGestures: false // disable gestures for this call
}, (payload) => {
// render subtitles
});
See TTS adapters guide for more examples of speakText case.
Minimal browser demo
async function fetchAsArrayBuffer(url) {
const r = await fetch(url);
return await r.arrayBuffer();
}
const url = 'https://example.com/link_to_audio_file';
const words = ['Hi','there','buddy'];
const wstarts = [ 445, 640, 821 ];
const wdur = [ 215, 201, 648 ];
const vis = ['I', 'TH', 'E', 'RR', 'PP', 'aa', 'DD', 'I'];
const vstarts = [496, 640, 665, 728, 821, 866, 960, 1046];
const vdur = [134, 45, 83, 63, 65, 114, 106, 423];
document.getElementById('playAudio').onclick = async () => {
const ab = await fetchAsArrayBuffer(url);
await avatar.speakAudio({ audio: ab });
};
document.getElementById('playWords').onclick = async () => {
const ab = await fetchAsArrayBuffer(url);
await avatar.speakWithWords({
audio: ab,
words: { tokens: words, starts: wstarts, durations: wdur }
}, { lipsyncLang: 'en' }, (payload) => {
// onSubtitles callback
});
};
document.getElementById('playVisemes').onclick = async () => {
const ab = await fetchAsArrayBuffer(url);
await avatar.speakWithVisemes({
audio: ab,
visemes: { labels: vis, starts: vstarts, durations: vdur }
});
};
document.getElementById('animateVisemes').onclick = async () => {
await avatar.animateVisemes({
visemes: { labels: vis, starts: vstarts, durations: vdur }
});
};
document.getElementById('animateWords').onclick = async () => {
await avatar.animateWords({
words: { tokens: words, starts: wstarts, durations: wdur }
}, { lipsyncLang: 'en' });
};
Tips & gotchas
- Time units: default is
'ms'. If you provide seconds, settimeUnit: 's'in the specific section (words,visemes,markers). - Zero/invalid durations: clips with
duration <= 0are skipped at current implementation. - Subtitles: the callback is effective only when
wordsare present. - Modes: you rarely need to touch
mode.animate*force'anim'; otherwise default'auto'is recommended. - TTS vs. Lipsync: changing TTS adapter voice/language does not change lipsyncLang. Set lipsyncLang in opts or at avatar init.
- Lexical gestures & tokens: gesture mapping is token-based; if your tokens include trailing spaces/punctuation, trim/normalize them or provide clean tokens to ensure matches.
Future plans
- Gesture presets: ready-to-use maps (casual / formal / energetic).
- Persona styles: different gesture density/tempo for character moods.
- Hooks: custom mapping rules and user-defined gestures.