Adding Audio to a Web Guide

How I added text-to-speech audio to the SSD Basics guide, from preprocessing to integration.

7 steps 12 min read 2026-04-03
AI Tools RecommendedSee full toolkit below →
Claude App
Preparation & planning
Claude Code CLI
Building & implementation
Codex CLI
Code review
Gemini
Research

I’ve pushed an update to the new SSD Basics page adding basic audio functionality to the guide. This should improve accessibility. Even without the right hardware making this type of change is fairly straightforward. With a nod to my recent post about where we’re going with the sub and content moving forward, I am including a brief outline on how this was done.

Step 1

Assess the Source Material

The SSD Basics guide is long but is naturally divided so this is just a matter of knowing where break points make sense. Since different font (headings) are used for sections this isn’t too bad. If you are writing your own content, keep this in mind as you are building as it will make an audio translation easier later. The basic takeaway here: structure your content before doing anything else.

How AI can help

If you are developing your structure before making your content, inform the AI of your ultimate desire ahead of time. If it knows you are planning to use audio it can keep things aligned from the get go. If you're doing it after the fact, some sections might not break evenly. You should define breakpoints manually where the natural ones do not flow well.

Step 2

Preprocessing

Text-to-speech (TTS) engines don’t handle technical terms and acronyms very well. You can write out a pronunciation/phonetic key for this ahead of time. Checking your material for this with even generic/free TTS can help identify. I missed a few, but you get the idea. There are some words that are ambiguous (for example, “SATA” legitimately can be said two different ways; no worries, not going to open the GIF discussion here). TTS will get many terms wrong in almost any field and if you are generating on a paid site retries can be costly.

How AI can help

You can use AI to build a basic glossary for you. You can also tell it how to say certain terms in this process. You should also do a full audio run on your content and add any specific terms that fail to parse correctly and add them to the glossary manually or by telling the AI.

Step 3

Text Extraction

This is only if you’ve already made the page or are deriving from markdown or code. A JSON manifest can then be generated to map sections with file paths for the audio player.

How AI can help

AI will extract all your data and make a "map" for elements to tie in your other-form content. For example, making sure the audio is mapped to the text. Sites will also have a table of contents and search functionality which can be improved. The takeaway here is that you're designing a site for usability but also engineering it to be easy to update.

Step 4

Set Up the TTS Engine

If you are not aware, there are some excellent open-source options out there for this. I chose Kokoro which is OpenAI-compatible. It runs on Docker and has GPU acceleration. On Windows, use Docker Desktop with WSL2 integration. GPU acceleration must be enabled and this depends on your GPU; I was using NVIDIA CUDA but AMD (ROCm) and Intel options exist for other TTS projects. Also, be careful to pick a good voice.

How AI can help

AI can help understand the code base here. Git clone codebases to start. AI can also set up and run the container for you once Docker Desktop (if using Windows) is installed with WSL2. It can also make the proper commands for audio work, saving you substantial time. It can also help you use GPU acceleration or pick the right project for your hardware.

Tip: If you want to use your own voice, it’s very possible and easy to clone your own for free (or honestly, any voice, even with very little source material). Chatterbox is a good, free option. ElevenLabs is a good free/paid option, as well. You can also do facial expressions and more for video but that will be covered in a separate guide.

Step 5

Chunk the Text and Generate Audio

It’s best to put boundaries for the generation process and each chunk is handled individually. Brief gaps are also put in for natural pacing. This process is very fast with a good GPU, guys; this took me ~3 minutes for 135K characters. This is for almost 3 hours in length! Quality and output is up to you. Understanding chunking even on this macro level is also useful for learning AI and AI data management and will help with other tasks or projects.

How AI can help

AI can help with the chunking process. It can also detect the natural pauses in your voice. This is also useful for generating subtitles for other content, but I'll cover that in other guides. Also, with the proper setup you can churn out this audio very rapidly, and while you can script this manually, AI allows you to pivot faster and more flexibly. This can be useful when working with multiple systems especially.

Important: You can fallback to CPU inference but it’s an order of magnitude slower.

Step 6

Integrate the Audio Player

I went basic with vanilla JS and simple play buttons. The playbar, however, can jump ahead or be dragged and has a speed toggle and time listed. This works on mobile, too. The sidebar section being played will be highlighted even if scrolling moves the currently highlighted section. The manifest is needed here, but a fallback exists. More advanced media functions can follow from basic design concepts.

How AI can help

AI can help you customize how the audio bar works. Where it's positioned, styling, speed modes, features like drag, and more. AI is also useful for testing sites for mobile which is very valuable. For full functionality for AI, you will want the right plugins and skills for web design. Puppeteer and Playwright are two examples.

Step 7

Other Tips

It’s worth doing this slowly to make sure things work before doing a full batch. However, if you chunk well, regeneration is not too bad. Matter of seconds on GPU. Output format is up to you as well, but I went with MP3 here; the default is WAV so conversion is done by ffmpeg. Understanding media format and quality is important for automation and for getting the output you want.

How AI can help

AI can run ahead of itself. Make sure to instruct it carefully and have it go step by step. You would normally test small things and build out from there, so don't accept massive project changes without verification. Being the human in the loop is critical. It also pays to do research before you do anything, don't rely on one AI to do this, either. In fact, you shouldn't rely on a single model to do anything. Consensus is valuable. Also, be willing to have conversations with your AI. Ask it why it did something or why you would do something a certain way. Ask it ahead of time about file formats, and why, and how to get something better, and is my desired format really better? Question everything and have it question and push back on you on everything.

Toolkit Reference

Below are the plugins, extensions, and MCP servers used across the steps in this guide.

Community Plugins

Everything Claude Code
Agent workflow optimization, skill system, and development patterns
Codex Integration
Code review via OpenAI Codex CLI

Official Plugins

These are built-in to Claude Code CLI and can be installed directly from the plugin menu.

superpowers
Planning, brainstorming, TDD, and implementation workflows
context7
Documentation lookup for Docker, Kokoro, ffmpeg, and other libraries
firecrawl
Web research for finding TTS projects and documentation
frontend-design
Audio player UI design and integration
playwright
Browser testing, including mobile audio player verification
serena
Semantic code analysis for understanding TTS engine codebases
code-review
Automated code review for player implementation
code-simplifier
Code cleanup and refactoring
feature-dev
Guided feature development workflow

MCP Servers

brave-search
Web search for research and finding tools
puppeteer
Browser automation and site testing

Integrations

GitHub
Source code, open-source projects, and cloning TTS engine repos