AI Narrated Stories

When I think of storytelling, I think of campfires. People collecting around a campfire to share a ghost story or spin a tale of what haunts the woods just outside the light of the fire. I wanted to explore how Artificial Intelligence (AI) could give voice to stories so that they could be enjoyed around a digital campfire.

The Problem

Audio is a major format for many people now-a-days. Not just podcasts, but voice-based interaction is dramatically growing in use. As I think about technologies that are going to dominate entertainment/storytelling in the (very near) future like Virtual Reality - I need a way to communicate stories that will feel natural and engaging.

Beyond the engagement, I also want to make sure that I can represent different voices in my story to create true immersion with the person experiencing the story. When I think about podcasts like The White Vault, part of what makes that podcast so great is the collection of voices that bring it to life. I want that. I want female voices to sound female and not like a male narrator trying to imitate a female voice. Feeling like you are IN the story is going to build the storytelling experience of the future.

So, I need lots of voices that will read whatever I want them to say and could possibly need to change the story based on inputs from the person experiencing the story. I could hire an army of voice actors that will be ready on demand or…

The Project: BREADCRUMBS

Start small and iterate. That’s what my innovation background has taught me. To achieve the goals of this storytelling capability, I needed to start small. Fortunately, I had a short story that was perfect for a small project. BREADCRUMBS (from BLOTS) is the reimagined story of Hansel and Gretel where the children aren’t as innocent as you’d think, and the “witch” isn’t as villainous as you were originally led to believe. This story was short, not much dialog and heavy on atmosphere.

I’ve experimented with using Amazon’s Alexa to tell stories before and knew that what I wanted to achieve, natural-ish language generated by a computer, was possible.

I can’t remember how I found Replica Studios but they were perfect for what I wanted. Many different voices (a lot fewer than they have now back in mid-2021) and fairly natural sounding language. The lack of dialog in the story was great because I could just focus on one narrator. There is a second speaker in BREADCRUMBS but that speaker only has a few lines. Using Replica Studios tool, I was able to take the content from the story and have it read by an AI voice. It took some experimenting to find the right pace, tone, emphasis moments in the recordings.

After recording the narration in Replica Studios, I used Adobe Audition to piece together the audio. At first, I tried to use a video recording tool to piece together the audio tracks but that didn’t work as well as I hoped. Why video first? I knew how to use Adobe Premier Rush and didn’t know how to use Audition at all. In the end, I focused on Audition because it was the right tool for the job and learning it was a lot of fun.

Once I stitched the narration together, something was missing. The story felt bland. Listening to some podcast audio dramas I quickly identified why: I needed sound effects. Using Story Blocks I was able to surface some quick and easy sound effects and music to add to the story.

But this wasn’t going to be delivered through Audible or a podcast so I needed a visual element. I used Unity3d to layer the audio into a 3d world. This not only provided a nice visual to go with the video, but allowed me to start testing integrating audio into 3d worlds (remember my VR statement earlier).

The result…

The Learnings

Oh…where to start. This project was much more complex than I originally planned but I’ve done a few more since then and can build a narration in a weekend. The learning curve is a bit steep but once you’ve got it, producing content can be very fast.

Say what you mean not what’s right

To start, the AI voices didn’t always pronounce words how I expected them to. For instance, my name, Kulp (pronounced: K-all-p) was originally pronounced Kup by the AI system. Working with the spelling, I was able to build the proper pronunciation but had to spell my name Kalp.

There were other words like “presents” and “oven” that were pronounced in unique ways. Ultimately, I needed to write to the voice. This means that as I changed voices (which I did numerous times - in Replica Studios it is called “recasting”), I had to update the content. Be ready to experiment with the voices.

This also means, based on Replica Studios’ billing model, that you can pay multiple times for the same thing to be read. Their billing structure is very reasonable and based on buying the time it takes the computer to produce the words you want it to say (massively over simplifying here but this gives you an idea of how it works). Each time you ask the AI voice to say something new (called Rendering), you pay for the seconds or minutes it takes for the AI system to create the audio.

This leads to the next point…

Render is small chunks

Don’t write 20 lines of text then hit “play” to see what comes out. If you have a lot of text to render and you make a small change, all the text renders again NOT just the part you changed. So, render in small chunks.

As an example, let’s say you have the following sentence:

The dog jumps far.

You render it and don’t like the way the word “far” sounds. Perhaps you want the word to sound like “fer” because that’s how your character talks. In this instance changing “far” to “fer” would cause the whole line to render again. This gets crazy when everything but one word is right and you have to spend your time rendering a longer sentence. Save yourself the crazy, work in small chunks when you can.

Voice is key to Character

Don’t just pick any voice. Your voice is going to tell a lot about your character. Be thoughtful on what you pick and how that voice speaks as your character. Imagine the story above told by a youthful voice or an old man’s voice. The characters are totally different.

In this story I’m asked why I picked a woman’s voice for the main character. To me it is obvious but those who read the story before hearing the story, often think the character is male until the scene in the house with the cookies (that’s when they link the Hansel & Gretel story).

Use the voice, tone, pace to tell your story.

Hear your story to hear your story

Not only is AI narration good for creating audio content it is great for hearing your story read to you in a more natural voice. The first iteration of this process led to a significant rewrite of the BREADCRUMBS story. I heard the story for the first time in someone else’s voice. That highlighted a lot of areas for improvement.

I found a lot of value in listening to my stories through Replica Studios but that takes a lot of Rendering minutes. Now, I use Microsoft Word’s Read Aloud feature to listen to the story. It is built into Word and is easy to use. Tip: I found increasing the speed by one notch makes for a better listening experience. Your results may vary.

So much more…

There were tons of other learnings here too. Everything from how the audio came together to leveling out the sound effects and building the 3d environment. All that will be saved for future posts.

What’s Next

I started on what’s next already. Further experiments in AI narration have started but instead of whole stories like this, I’m using AI narration as companion content for the stories told through Facebook Messenger. I talked about using Messenger in another portfolio post. You can learn more there.

Next, I’m planning on exploring dynamic language generation through the Replica Studios API. This next experiment will be using VR characters to create a conversation in a virtual space. Stay tuned as I explore more AI + VR use cases in storytelling.

Until then, enjoy another AI narrated story: The Midnight Pets.

Thank you to Brett Sayles from Pexels for the cover art of this post.

tim kulpComment