TTS adapters guide

TTS Adapters Guide

This tutorial explains how to plug any TTS provider into AvaCapo Talking using small adapter classes. All built-in adapters are exposed under the SDK namespace:


Quick start

ESM

import AvaCapoSDK from 'avacapo-sdk';

const { RestTTSAdapter, GoogleTTSAdapter } = AvaCapoSDK.ttsAdapters;

const tts = new RestTTSAdapter({
  endpoint: 'https://your-tts.example.com/v1/synthesize' // your TTS HTTP endpoint
});

const sdk  = await AvaCapoSDK.init({ apiKey: 'YOUR_KEY', container: '#app', preload: ['talking'] });
const avatar = await sdk.talkingAvatar(null, {
  ttsAdapter: tts,
  ttsOpts: {
    voice: 'af_bella',
    timeUnit: 'ms',
    adapterConfig: {
      language: 'en-us',
      audioEncoding: 'wav',
      // Best practice: fetch a short-lived token from your backend:
      jwtGet: async () => (await (await fetch('/api/jwt')).json()).token
    }
  }
});

await avatar.speakText('Hello there!');

UMD (<script> tag)

<script>
  const { RestTTSAdapter } = AvaCapoSDK.ttsAdapters;
  const tts = new RestTTSAdapter({ endpoint: 'https://your-tts.example.com/v1/synthesize' });

  const sdk = new AvaCapoSDK({ apiKey: 'YOUR_KEY', container: '#avatar', preload: ['talking'] });
  sdk.init().then(async () => {
    const avatar = await sdk.talkingAvatar(null, { ttsAdapter: tts, ttsOpts: { voice: 'af_bella' }});
    avatar.speakText('Hi!');
  });
</script>

Built-in adapters (classes)

All adapters support:

  • JWT via adapterConfig.jwtGet() → adds Authorization: Bearer <token> (override with jwtHeaderName, jwtPrefix)
  • API keys via adapterConfig.apiKey?key=... (name override: apiKeyParam)
  • Abort via AbortSignal

new RestTTSAdapter(defaults)

Flexible HTTP/REST adapter (JSON or raw audio/*). Zero boilerplate for typical services.

Default request (overridable): POST JSON { input, voice, language, speed, audioEncoding }

Default response (accepted shapes):

  • JSON with audioContent (base64), optional words/wtimes/wdurations or visemes/vtimes/vdurations
  • timepoints: [{ markName, timeSeconds }]
  • raw audio/* (binary body)
const rest = new RestTTSAdapter({
  endpoint: 'https://your-tts.example.com/v1/synthesize',
  audioEncoding: 'mp3',
  language: 'en-us'
});

Override mapping if your API differs:

const rest = new RestTTSAdapter({
  endpoint: 'https://api.example.com/tts',
  buildRequest(input, cfg) {
    const headers = { 'Content-Type': 'application/json', ...(cfg.headers || {}) };
    const body = JSON.stringify({
      q: input.text, voiceId: input.voice || cfg.voice,
      speed: input.rate ?? 1, lang: cfg.language || 'en-us',
      format: cfg.audioEncoding || 'mp3'
    });
    return { url: cfg.endpoint, init: { method: 'POST', headers, body, signal: input.signal } };
  },
  async parseResponse(res, input) {
    const data  = await res.json();
    const audio = Uint8Array.from(atob(data.audioB64), c => c.charCodeAt(0)).buffer;
    return { audio, timeUnit: input.timeUnit || 'ms', words: data.words, wtimes: data.stamps };
  }
});

new GoogleTTSAdapter(defaults)

Google Cloud Text-to-Speech (JSON, base64 audioContent).

const google = new GoogleTTSAdapter({
  endpoint: 'https://texttospeech.googleapis.com/v1/text:synthesize',
  audioEncoding: 'MP3'
});

Contracts (what adapters receive/return)

TTSAdapterInput

This is a temporary contract structure, will be extended in the future to be more flexible

{
  text: string;                 // plain text (no SSML required here)
  ssml?: string[];              // optional ssml representation of input text. Most TTS engines prefer it over plain text
  words?: string[];             // optional: tokens, helpful if provider returns marks only
  voice?: string | object;      // provider voice id or object
  rate?: number; pitch?: number; volume?: number;
  timeUnit?: 'ms' | 's';        // default 'ms'
  adapterConfig?: any;          // merged into adapter defaults (endpoint, headers, jwtGet, etc.)
  signal?: AbortSignal;         // cancellation
}

TTSAdapterOutput

{
  audio: ArrayBuffer;           // required (encoded or raw; SDK decodes)
  timeUnit?: 'ms' | 's';        // default 'ms' (check your TTS engine documentation)

  // optional — word timings
  words?: string[];
  wtimes?: number[];            // starts
  wduration?: number[];         // durations (auto-derived if omitted)

  // optional — viseme timings
  visemes?: string[];
  vtimes?: number[];
  vduration?: number[];
}

Notes:

  • If you only return audio, the pipeline still animates (envelope fallback).
  • If the provider returns only timepoints (marks), map them to wtimes; durations will be derived unless you provide wduration.
  • Units default to milliseconds; you can return 's' and the SDK will normalize.

Creating your own adapter (full control)

Extend the base class when your provider is not a simple REST JSON API (SDK/WebSocket/streaming).

ESM (recommended)

import AvaCapoSDK from 'avacapo-sdk';

// Pull adapter base + errors from the public SDK namespace
const { BaseTTSAdapter, TTSSynthesizeError } = AvaCapoSDK.ttsAdapters;

export class MyWsAdapter extends BaseTTSAdapter {
  constructor() {
    super({ language: 'en-us', audioEncoding: 'wav' }); // adapter defaults (optional)
  }

  async synthesize(input) {
    const cfg = this._mergeConfig(input.adapterConfig);
    const { text, signal } = input;

    // Your transport: WebSocket/SDK/etc.
    const { audio, words, wtimes } = await myWsClient.synthesize({ text, cfg, signal });
    if (!audio) throw new TTSSynthesizeError('Empty audio');

    // Normalize (adds timeUnit, derives missing durations, trims arrays)
    return this._normalizeOutput(
      { audio, timeUnit: input.timeUnit || 'ms', words, wtimes },
      input
    );
  }
}

UMD (<script> tag)

<script>
  const { BaseTTSAdapter, TTSSynthesizeError } = AvaCapoSDK.ttsAdapters;

  class MyWsAdapter extends BaseTTSAdapter {
    constructor() {
      super({ language: 'en-us', audioEncoding: 'wav' });
    }

    async synthesize(input) {
      const cfg = this._mergeConfig(input.adapterConfig);
      const { text, signal } = input;

      // Your transport: WebSocket/SDK/etc.
      const { audio, words, wtimes } = await myWsClient.synthesize({ text, cfg, signal });
      if (!audio) throw new TTSSynthesizeError('Empty audio');

      return this._normalizeOutput(
        { audio, timeUnit: input.timeUnit || 'ms', words, wtimes },
        input
      );
    }
  }
</script>

Use it:

const mine = new MyWsAdapter();
const avatar = await sdk.talkingAvatar(null, { ttsAdapter: mine });
await avatar.speakText('Custom transport works!');

Authentication (best practice)

Do not ship secrets in client code. Provide a short-lived token from your backend and fetch it at runtime:

ttsOpts: {
  adapterConfig: {
    jwtGet: async () => (await (await fetch('/api/jwt')).json()).token,
    // jwtHeaderName: 'Authorization', jwtPrefix: 'Bearer ', // defaults
    // apiKey: '...', apiKeyParam: 'key'
  }
}

RestTTSAdapter automatically adds the Authorization header and/or ?key=... query param.


Cancellation

Adapters receive an AbortSignal as input.signal. In RestTTSAdapter this is wired automatically; in custom adapters, forward it to your fetch/SDK/WebSocket.

Runtime API:

avatar.abortActiveSpeech(); // stops current synthesis + playback

Local HeadTTS example (via Rest adapter)

const { RestTTSAdapter } = AvaCapoSDK.ttsAdapters;

const local = new RestTTSAdapter({
  endpoint: 'https://your-tts.example.com/v1/synthesize'
});

const avatar = await sdk.talkingAvatar(null, {
  ttsAdapter: local,
  ttsOpts: {
    voice: 'af_bella',
    timeUnit: 'ms',
    adapterConfig: {
      language: 'en-us',
      audioEncoding: 'wav'
    }
  }
});

avatar.speakText('Hi from local TTS!');

Troubleshooting

  • No audio: ensure you return an ArrayBuffer (convert base64 to bytes first).
  • Timings off: verify timeUnit ('ms' vs 's').
  • 401/403: check jwtGet() and your backend/CORS config.
  • Abort doesn’t work: make sure you pass input.signal into your request logic.

Summary

  1. Pick a built-in adapter from AvaCapoSDK.ttsAdapters or write your own by extending BaseTTSAdapter.
  2. Return { audio, ...optional timings } in TTSAdapterOutput.
  3. Hand it to TalkingAvatar via ttsAdapter and call speakText().

That’s it – the SDK takes care of decoding, queueing, and lip-sync animation.