Speech methods overview

Examples of speech methods — quick start

High-level methods

Use these methods on your avatar instance:

  • say(input, opts?, onSubtitles?) — smart router; picks the right path based on input.
  • speakWithWords(input, opts?, onSubtitles?)audio + words → derives visemes via lipsync.
  • speakWithVisemes(input, opts?, onSubtitles?)audio + visemes (already annotated).
  • animateWords(input, opts?, onSubtitles?) — words → animation only (TEMP: disabled/unstable).
  • animateVisemes(input, opts?, onSubtitles?) — visemes → animation only (TEMP: disabled/unstable).
  • speakText(textOrSsml, opts?, onSubtitles?) — TTS path (text/SSML → audio → playback).
  • speakAudio(input, opts?, onSubtitles?) — play audio only (markers/extraAnim optional).

Common opts

{
  // timing / normalization
  trimStartMs?: number,
  trimEndMs?: number,
  trailingBreakMs?: number,
  animTailMs?: number,
  applyTrimShift?: boolean,

  // lipsync (separate from TTS voice/language)
  lipsyncLang?: string,            // e.g. 'en', 'es', 'de' (currently only English supported)

  // gestures
  enableLexicalGestures?: boolean  // default: true; set false to disable lexical gestures
}

lipsyncLang only affects lipsync/viseme derivation. It is not controlled by TTS options and must be set via opts (or at avatar init).

onSubtitles(payload) — if provided and words exist, the SDK will inject a per-word subtitles clip and attach the callback to the resulting items.


Input shapes

Audio (same everywhere)

audio: AudioBuffer | ArrayBuffer | ArrayBuffer[],
sampleRate?: number, // only for PCM in ArrayBuffer
timeUnit?: 'ms' | 's' // top-level default for nested sections (default 'ms')

Words

words: {
  tokens:    string[],     // e.g. ["Hello","world"]
  starts:    number[],     // word START times
  durations: number[],     // explicit durations
  timeUnit?: 'ms' | 's'
}

Visemes

visemes: {
  labels:    string[],     // e.g. ["AA","PP","FF",...]
  starts:    number[],     // viseme START times
  durations: number[],     // explicit durations
  timeUnit?: 'ms' | 's'
}

Markers (optional)

markers: {
  labels: (string|function)[], // e.g. ["blink", () => console.log("beat")]
  times:  number[],
  timeUnit?: 'ms' | 's'
}

Extra animation (optional)

Note: extraAnim is advanced and may be ignored by the default renderer.

extraAnim: Clip[], // passed through as-is
anim: any,         // full animation object (if your factory supports it)
mode?: 'auto' | 'audio' | 'anim' // usually leave as-is; animate* forces 'anim'

Typical flows

1. Audio + words (visemes are derived automatically)

await avatar.speakWithWords({
  audio: await fetchAsArrayBuffer('/sample.mp3'),
  words: {
    tokens: ['Hi','there','buddy'],
    starts: [ 445, 640, 821 ],
    durations: [ 215, 201, 648 ], // same unit as `starts`
    // timeUnit: 'ms'
  }
}, {
  lipsyncLang: 'en'
}, (payload) => {
  // render subtitles
});

2. Audio + precomputed visemes

await avatar.speakWithVisemes({
  audio: await fetchAsArrayBuffer('/sample.mp3'),
  visemes: {
    labels: ['I', 'TH', 'E', 'RR', 'PP', 'aa', 'DD', 'I'],
    starts: [496, 640, 665, 728,  821, 866, 960, 1046],
    durations: [134, 45,  83,  63, 65, 114, 106, 423],
    timeUnit: 'ms'
  }
});

3. Smart router

Here’s how the smart route (the default say(...) path) currently works:

  1. audio + visemes → speakWithVisemes
  2. audio + words → speakWithWords
  3. audio only → speakAudio
  4. words only → animateWords (temporarily unstable/disabled)
  5. visemes only → animateVisemes (temporarily unstable/disabled)
  6. text / SSML → speakText (external TTS path)

This is the out-of-the-box behavior today. In a future update, you’ll be able to override this selection logic via opts (custom routing rules / priorities).

// audio + visemes → speakWithVisemes
await avatar.say({
  audio: await fetchAsArrayBuffer('/sample.mp3'),
  visemes: { labels:[...], starts:[...], durations:[...] }
});

// audio + words → speakWithWords
await avatar.say({
  audio: await fetchAsArrayBuffer('/sample.mp3'),
  words: { tokens:[...], starts:[...], durations:[...] }
}, { lipsyncLang: 'en' });

// audio only → speakAudio
await avatar.say({ audio: await fetchAsArrayBuffer('/sample.mp3') });

// words only → animateWords (TEMP disabled/unstable)
await avatar.say({ words: { tokens:[...], starts:[...], durations:[...] } });

// visemes only → animateVisemes (TEMP disabled/unstable)
await avatar.say({ visemes: { labels:[...], starts:[...], durations:[...] } });

// text/SSML → TTS
await avatar.say({ text: 'Hello there!' }, { lipsyncLang: 'en' });

4. Audio only

Experimental: may not work in all builds.

await avatar.speakAudio({
  audio: await fetchAsArrayBuffer('/sample.mp3'),
  // markers?: { labels:[...], times:[...], timeUnit:'ms' }
});

5. Text → TTS

await avatar.speakText('Hello there!', {
  lipsyncLang: 'en',
  enableLexicalGestures: false // disable gestures for this call
}, (payload) => {
  // render subtitles
});

See TTS adapters guide for more examples of speakText case.


Minimal browser demo

async function fetchAsArrayBuffer(url) {
  const r = await fetch(url);
  return await r.arrayBuffer();
}

const url = 'https://example.com/link_to_audio_file';

const words = ['Hi','there','buddy'];
const wstarts = [ 445, 640, 821 ];
const wdur = [ 215, 201, 648 ];

const vis = ['I', 'TH', 'E', 'RR', 'PP', 'aa', 'DD', 'I'];
const vstarts = [496, 640, 665, 728,  821, 866, 960, 1046];
const vdur = [134, 45,  83,  63, 65, 114, 106, 423];

document.getElementById('playAudio').onclick = async () => {
  const ab = await fetchAsArrayBuffer(url);
  await avatar.speakAudio({ audio: ab });
};

document.getElementById('playWords').onclick = async () => {
  const ab = await fetchAsArrayBuffer(url);
  await avatar.speakWithWords({
    audio: ab,
    words: { tokens: words, starts: wstarts, durations: wdur }
  }, { lipsyncLang: 'en' }, (payload) => {
    // onSubtitles callback
  });
};

document.getElementById('playVisemes').onclick = async () => {
  const ab = await fetchAsArrayBuffer(url);
  await avatar.speakWithVisemes({
    audio: ab,
    visemes: { labels: vis, starts: vstarts, durations: vdur }
  });
};

document.getElementById('animateVisemes').onclick = async () => {
  await avatar.animateVisemes({
    visemes: { labels: vis, starts: vstarts, durations: vdur }
  });
};

document.getElementById('animateWords').onclick = async () => {
  await avatar.animateWords({
    words: { tokens: words, starts: wstarts, durations: wdur }
  }, { lipsyncLang: 'en' });
};

Tips & gotchas

  • Time units: default is 'ms'. If you provide seconds, set timeUnit: 's' in the specific section (words, visemes, markers).
  • Zero/invalid durations: clips with duration <= 0 are skipped at current implementation.
  • Subtitles: the callback is effective only when words are present.
  • Modes: you rarely need to touch mode. animate* force 'anim'; otherwise default 'auto' is recommended.
  • TTS vs. Lipsync: changing TTS adapter voice/language does not change lipsyncLang. Set lipsyncLang in opts or at avatar init.
  • Lexical gestures & tokens: gesture mapping is token-based; if your tokens include trailing spaces/punctuation, trim/normalize them or provide clean tokens to ensure matches.

Future plans

  • Gesture presets: ready-to-use maps (casual / formal / energetic).
  • Persona styles: different gesture density/tempo for character moods.
  • Hooks: custom mapping rules and user-defined gestures.