TTS Adapters Guide
This tutorial explains how to plug any TTS provider into AvaCapo Talking using small adapter classes. All built-in adapters are exposed under the SDK namespace:
Quick start
ESM
import AvaCapoSDK from 'avacapo-sdk';
const { RestTTSAdapter, GoogleTTSAdapter } = AvaCapoSDK.ttsAdapters;
const tts = new RestTTSAdapter({
endpoint: 'https://your-tts.example.com/v1/synthesize' // your TTS HTTP endpoint
});
const sdk = await AvaCapoSDK.init({ apiKey: 'YOUR_KEY', container: '#app', preload: ['talking'] });
const avatar = await sdk.talkingAvatar(null, {
ttsAdapter: tts,
ttsOpts: {
voice: 'af_bella',
timeUnit: 'ms',
adapterConfig: {
language: 'en-us',
audioEncoding: 'wav',
// Best practice: fetch a short-lived token from your backend:
jwtGet: async () => (await (await fetch('/api/jwt')).json()).token
}
}
});
await avatar.speakText('Hello there!');
UMD (<script> tag)
<script>
const { RestTTSAdapter } = AvaCapoSDK.ttsAdapters;
const tts = new RestTTSAdapter({ endpoint: 'https://your-tts.example.com/v1/synthesize' });
const sdk = new AvaCapoSDK({ apiKey: 'YOUR_KEY', container: '#avatar', preload: ['talking'] });
sdk.init().then(async () => {
const avatar = await sdk.talkingAvatar(null, { ttsAdapter: tts, ttsOpts: { voice: 'af_bella' }});
avatar.speakText('Hi!');
});
</script>
Built-in adapters (classes)
All adapters support:
- JWT via
adapterConfig.jwtGet()→ addsAuthorization: Bearer <token>(override withjwtHeaderName,jwtPrefix) - API keys via
adapterConfig.apiKey→?key=...(name override:apiKeyParam) - Abort via
AbortSignal
new RestTTSAdapter(defaults)
Flexible HTTP/REST adapter (JSON or raw audio/*). Zero boilerplate for typical services.
Default request (overridable): POST JSON { input, voice, language, speed, audioEncoding }
Default response (accepted shapes):
- JSON with
audioContent(base64), optionalwords/wtimes/wdurationsorvisemes/vtimes/vdurations timepoints: [{ markName, timeSeconds }]- raw
audio/*(binary body)
const rest = new RestTTSAdapter({
endpoint: 'https://your-tts.example.com/v1/synthesize',
audioEncoding: 'mp3',
language: 'en-us'
});
Override mapping if your API differs:
const rest = new RestTTSAdapter({
endpoint: 'https://api.example.com/tts',
buildRequest(input, cfg) {
const headers = { 'Content-Type': 'application/json', ...(cfg.headers || {}) };
const body = JSON.stringify({
q: input.text, voiceId: input.voice || cfg.voice,
speed: input.rate ?? 1, lang: cfg.language || 'en-us',
format: cfg.audioEncoding || 'mp3'
});
return { url: cfg.endpoint, init: { method: 'POST', headers, body, signal: input.signal } };
},
async parseResponse(res, input) {
const data = await res.json();
const audio = Uint8Array.from(atob(data.audioB64), c => c.charCodeAt(0)).buffer;
return { audio, timeUnit: input.timeUnit || 'ms', words: data.words, wtimes: data.stamps };
}
});
new GoogleTTSAdapter(defaults)
Google Cloud Text-to-Speech (JSON, base64 audioContent).
const google = new GoogleTTSAdapter({
endpoint: 'https://texttospeech.googleapis.com/v1/text:synthesize',
audioEncoding: 'MP3'
});
Contracts (what adapters receive/return)
TTSAdapterInput
This is a temporary contract structure, will be extended in the future to be more flexible
{
text: string; // plain text (no SSML required here)
ssml?: string[]; // optional ssml representation of input text. Most TTS engines prefer it over plain text
words?: string[]; // optional: tokens, helpful if provider returns marks only
voice?: string | object; // provider voice id or object
rate?: number; pitch?: number; volume?: number;
timeUnit?: 'ms' | 's'; // default 'ms'
adapterConfig?: any; // merged into adapter defaults (endpoint, headers, jwtGet, etc.)
signal?: AbortSignal; // cancellation
}
TTSAdapterOutput
{
audio: ArrayBuffer; // required (encoded or raw; SDK decodes)
timeUnit?: 'ms' | 's'; // default 'ms' (check your TTS engine documentation)
// optional — word timings
words?: string[];
wtimes?: number[]; // starts
wduration?: number[]; // durations (auto-derived if omitted)
// optional — viseme timings
visemes?: string[];
vtimes?: number[];
vduration?: number[];
}
Notes:
- If you only return
audio, the pipeline still animates (envelope fallback). - If the provider returns only timepoints (marks), map them to
wtimes; durations will be derived unless you providewduration. - Units default to milliseconds; you can return
's'and the SDK will normalize.
Creating your own adapter (full control)
Extend the base class when your provider is not a simple REST JSON API (SDK/WebSocket/streaming).
ESM (recommended)
import AvaCapoSDK from 'avacapo-sdk';
// Pull adapter base + errors from the public SDK namespace
const { BaseTTSAdapter, TTSSynthesizeError } = AvaCapoSDK.ttsAdapters;
export class MyWsAdapter extends BaseTTSAdapter {
constructor() {
super({ language: 'en-us', audioEncoding: 'wav' }); // adapter defaults (optional)
}
async synthesize(input) {
const cfg = this._mergeConfig(input.adapterConfig);
const { text, signal } = input;
// Your transport: WebSocket/SDK/etc.
const { audio, words, wtimes } = await myWsClient.synthesize({ text, cfg, signal });
if (!audio) throw new TTSSynthesizeError('Empty audio');
// Normalize (adds timeUnit, derives missing durations, trims arrays)
return this._normalizeOutput(
{ audio, timeUnit: input.timeUnit || 'ms', words, wtimes },
input
);
}
}
UMD (<script> tag)
<script>
const { BaseTTSAdapter, TTSSynthesizeError } = AvaCapoSDK.ttsAdapters;
class MyWsAdapter extends BaseTTSAdapter {
constructor() {
super({ language: 'en-us', audioEncoding: 'wav' });
}
async synthesize(input) {
const cfg = this._mergeConfig(input.adapterConfig);
const { text, signal } = input;
// Your transport: WebSocket/SDK/etc.
const { audio, words, wtimes } = await myWsClient.synthesize({ text, cfg, signal });
if (!audio) throw new TTSSynthesizeError('Empty audio');
return this._normalizeOutput(
{ audio, timeUnit: input.timeUnit || 'ms', words, wtimes },
input
);
}
}
</script>
Use it:
const mine = new MyWsAdapter();
const avatar = await sdk.talkingAvatar(null, { ttsAdapter: mine });
await avatar.speakText('Custom transport works!');
Authentication (best practice)
Do not ship secrets in client code. Provide a short-lived token from your backend and fetch it at runtime:
ttsOpts: {
adapterConfig: {
jwtGet: async () => (await (await fetch('/api/jwt')).json()).token,
// jwtHeaderName: 'Authorization', jwtPrefix: 'Bearer ', // defaults
// apiKey: '...', apiKeyParam: 'key'
}
}
RestTTSAdapter automatically adds the Authorization header and/or ?key=... query param.
Cancellation
Adapters receive an AbortSignal as input.signal. In RestTTSAdapter this is wired automatically; in custom adapters, forward it to your fetch/SDK/WebSocket.
Runtime API:
avatar.abortActiveSpeech(); // stops current synthesis + playback
Local HeadTTS example (via Rest adapter)
const { RestTTSAdapter } = AvaCapoSDK.ttsAdapters;
const local = new RestTTSAdapter({
endpoint: 'https://your-tts.example.com/v1/synthesize'
});
const avatar = await sdk.talkingAvatar(null, {
ttsAdapter: local,
ttsOpts: {
voice: 'af_bella',
timeUnit: 'ms',
adapterConfig: {
language: 'en-us',
audioEncoding: 'wav'
}
}
});
avatar.speakText('Hi from local TTS!');
Troubleshooting
- No audio: ensure you return an
ArrayBuffer(convert base64 to bytes first). - Timings off: verify
timeUnit('ms'vs's'). - 401/403: check
jwtGet()and your backend/CORS config. - Abort doesn’t work: make sure you pass
input.signalinto your request logic.
Summary
- Pick a built-in adapter from
AvaCapoSDK.ttsAdaptersor write your own by extendingBaseTTSAdapter. - Return
{ audio, ...optional timings }inTTSAdapterOutput. - Hand it to
TalkingAvatarviattsAdapterand callspeakText().
That’s it – the SDK takes care of decoding, queueing, and lip-sync animation.