[Mechanical voice] Today I have
something different for you.

Let's create an English
speech synthesizer –

with Finnish accent!

[Human voice] Earlier,
I made three videos –

providing background information
for this video.

If you haven’t watched them yet,
I suggest you do so now.

I have compiled them
into a nice playlist,

so you can watch them
all in one sitting.

Please do so now,
and then come back here.

Click the card that
opens the playlist.

I’ll be waiting here.

Unlike in the PCM video,

replaying pre-recorded speech
is not what I had in mind –

for this video.

To recap, here is
the list of phonemes –

that will be building blocks
of speech for this synthesizer.

The bottom row lists phonemes –

that most Finns do not
pronounce correctly,

but are instead aliased
into other phonemes.

I think for authenticity,

we can do the same
in our speech synthesizer.

This will help keep
the design simple.

This is the resulting
roster of phonemes.

There are 22 phonemes in total.

Now there are many ways
to go forward from here.

One of the popular approaches,

that leads into high quality
speech synthesizers,

is to create a list of
all pairs of phonemes –

that can occur in normal speech.

For example, all consonents
followed by all vowels;

but also all vowels followed
by all consonants,

and of course,
all pairs of vowels –

that can be reasonably
pronounced,

and all pairs of consonants too.

They would hire
a voice artist –

to record all these
hundreds of samples,

at constant pitch
and constant stress.

For some speech synthesizers,
even triplets might be recorded.

Most likely,

they would construct an
artificial long piece of text,

that contains all
these phoneme pairs,

and the voice artist
would be instructed –

to read it as monotonously
as they possibly can.

Then someone would use
an audio editing program,

and meticulously cut pieces
from the recording –

to populate this table.

The speech synthesizer would mix
and select these samples at run time.

For example, this word,
“kivijalkakauppa”,

would be constructed
from 15 voice samples,

some of which are identical,

and the synthesizer would
seamlessly blend –

the end of one sample
to the beginning of the next.

For my demo speech synthesizer,

I will not do
anything that complex.

I am going to operate
on single phonemes only.

Now I could just
use recordings of myself –

speaking all these
different phonemes,

and it would not take
much time to do that at all.

Instead, I designed to
approach the problem –

in an old-fashioned way.

So, I made this chart.

It shows how each of these
phonemes might be constructed.

The first 14 phonemes
have something in common:

the vocal cords are vibrating
throughout the phoneme.

For example, I can say:

“ilmiömäinen mongolialainen
viininviljelyalue”

in a single
unbroken voice,

using all of
those 14 phonemes.

Let’s speak about
the vowels first.

Humans can speak, because
we are able to change –

how our voice resonates
within our mouth.

Bear with me for a moment:

I am going to create an incredibly
stupid sounding recording.

[Chants vowel sounds]

Now let’s clean up that audio,
and save it on disk.

Next, let’s open
the audio in Praat.

Praat is an open source program
for studying phonetics.

While I was doing
that recording,

you heard several
different vowels.

My voice stayed at
a constant pitch,

but different harmonic overtones were created –

by varying the shape of
airways within my mouth.

In this analysis window,

we can actually
see what happened.

In the bottom there is a blue line
indicating my voice pitch.

It is relatively horizontal,

which means there was not
much variation in it.

However, these red lines represent
the harmonic overtones of my voice,

and they are all
over the place,

changing smoothly between low and high values.

In speech,

these harmonic overtones
are called formants.

In Wikipedia –

there is a relatively brief
article about formants,

including this table,

on typical values for
the formants of different vowels.

Each of the 14 sounds
have different formants –

that make up the sound.

Formants are produced by different
parts of the vocal tract,

including the larynx
and the pharynx.
 
For a speech synthesizer,

the exact mechanism is not
as important as the result.

Additionally, with
the /ʋ/ and /j/ sounds,

there is some level
of frication present.

It is a little
whooshing component.

The whooshing sounds a bit
different in each consonant.

It may be higher pitched
or lower pitched,

and it may be
short or long.

In other words,
there is a sound source,

and a tube that adds resonants
and noises to the sound.

This is called
a source-filter model.

An audio compression method
called Linear Predictive Coding –

is centered around this scheme.

“LPC starts with the assumption –

that a speech signal is produced
by a buzzer at the end of a tube,

with occasional added hissing
and popping sounds.

Although apparently crude,

this model is actually
a close approximation –

of the reality of
speech production.”

Do you have a cell phone?

LPC happens to be the basis
of GSM voice compression.

If you have a cell phone,
it contains an implementation of LPC.
	
So, I am going to use LPC
also for this synthesizator.

In this table –

I have identified the component
sounds that I need to synthesize.

For the first 14 phonemes,

we have voice that is
modulated in different ways,

plus some optional
frication at the same time.

The rest of the consonants
are similar,

except there is no
voice simultaneously.

I have split each phoneme
into three parts:

A beginning, a middle,
and an end.

Each phoneme may have
a short sound of some kind –

in the beginning and in the end.

For example,
at the end of /m/,

there is a subtle sound
from the lips.

The middle is the
part of the phoneme –

that is stretched as long
as it needs to be –

to produce short or long sounds.

So I need a total of
17 sustain-sounds,

7 release-sounds,
1 glottal noise,

and silence.

Total 26 sound samples.

To generate the samples,

I recorded myself saying this
sequence as monotonously as I could:

[Enunciates phonemes]

This recording was
imported in Praat, 

Then I edited the sound to
make it completely monotonous.

In hindsight, this step
was completely redundant,

but it was nice to learn
that this research tool –

could double as an autotune
program for bad singers.

This is how the result sounds.

This was then downsampled into 44 kHz,

removing some mostly
irrelevant detail.

Then, I used Praat to generate
a 48-order LPC from this recording.

The resulting file
looks like this.

It’s a text file
that contains… numbers.

The audio was divided into frames,

and for each frame,

a set of coefficients
and a gain is listed.

Next, I wrote a C++ program
to play this file.

The program reads
all lines in file,

and identifies
their content.

It saves important parameters,
like the samplingPeriod,

which is the inverse
of the sampling rate,

into variables.

The coefficients are
saved in an array.

When it encounters the gain-line,
it synthesizes the frame.

The frame is synthesized next.

It starts by generating
an arbitrary buzz.

Anything goes, as long as
it has a clear frequency,

and as long as it’s
not a pure sine wave.

Next, the LPC filter is applied.

The filter shapes the
frequency characteristics –

of the buzz that is fed to it,

much like a FIR filter.

Basically it’s a vocoder.

The resulting sample
is saved into a buffer.

Once the file
is done with,

the buffer is saved
into a wave file.

And this is how it sounds.

48 was my choice for
the order of the LPC data.

I made a comparison
for different LPC orders.

Here’s a short voice sample –

that I took from one of
Dr. David Wood’s videos.

[Dr. Wood’s voice] “How to
stop prison radicalization.”

And here’s how it sounds
at different orders.

“How to stop prison radicalization.”

“How to stop prison radicalization.”

“How to stop prison radicalization.”

“How to stop prison radicalization.”

“How to stop prison radicalization.”

“How to stop prison radicalization.”

“How to stop prison radicalization.”

“How to stop prison radicalization.”

I think that 48
was the sweet spot –

where artifacts
were minimal,

and increasing coefficients from 48 –

did not significantly
improve the audio –

to justify the increase of data.

Now it is important to note that
the LPC file is not a recording.

It is a synthesis instruction.

For example, I can modify
the “buzz” formula –

and replace it with white noise.

[Whispered] “How to stop
prison radicalization.”

This changes the voice into a whisper.

Or I can change the tempo.
Make it four times slower.

Or make it twice as fast!

Or change the pitch. Make it higher.

Or make it lower.

My buzz formula
deliberately contains –

a small amount
of aspiration in it.

If I remove the aspiration
and leave just the buzz,

the sound becomes
a bit cleaner –

but also more
synthetic sounding.

These samples are
recorded at 44 kilohertz.

If I used a much
smaller sample rate,

such as 8 kilohertz,

a much smaller number of
coefficients would be enough.

Here is the 16-coefficient LPC
made from a 44 kilohertz recording:

“How to stop prison radicalization.”

And here is a 16-coefficient LPC
made from a 8 kilohertz recording:

“How to stop prison radicalization.”

The latter was a bit more muffled,
like a telephone line,

but had way less
chirping artifacts in it.

Lowering the sample rate
gives more bang for buck –

in terms of data transmission,

and that’s why cellphones
use a low sample rate.

But there’s plenty of low-sample-rate
speech synthesizers out there,

and I want to use
a good sample rate,

so I’m going with 44 kilohertz
and 48th-order LPC.

So the LPC file is
divided into frames,

each frame representing
the characteristics of the audio –

for a small slice of time.

Next I spent a day
writing this tool –

which is a modification of the
WAV-writing program from earlier.

This program allows you
to adjust the parameters,

such as breathiness and buzziness,
in real time,

and to choose any frame
from the record to play.

I used this to pick frames that,
in my opinion,

best represented the phonemes –

that I wanted to include
in my speech synthesizer.

Next, I wrote a tool that copypastes
the frames that I picked,

and it produced this file.
It is C++ source code.

Which brings me to the next part:
C++ source code.

We begin with the datastructure
that was just generated.

This saves each recording
as a structure.

I decided that each recording can have
multiple frames rather than just one,

for better quality.

The process of text-to-speech
begins by reading the text input –

and converting it into
a list of phonemes,

or rather, prosody elements.

First we start by
normalizing the text,

removing as much unnecessary
detail as possible –

such as converting
all of it into lowercase.

I also went ahead and converted
it into 32-bit unicode,

because dealing with text
character by character –

is quite difficult in utf-8,

when a single character can
span across multiple bytes.

I mean it's still not perfect
even in 32-bit unicode –

because of combining
diacritics and stuff,

but you get what I mean.

It helps with this application.

Punctuation must be
also taken care of.

I decided to add special symbols,
the angled brackets,

that will be later used to
control the pitch of the voice.

I’ll just leave the
pitch handling blank for now –

and get back to it later.

Now that the text
has been canonized –

and the WIP string should only contain
pronouncable letters and pause markers,

let’s convert it into indexes
into the sound recordings list.

This code is a bit complicated
for what it actually does;

it basically just assigns
a timing value for each phoneme,

depending whether
it is repeated or not.

If you are interested in
exploring it in detail,

you can download the source code,

which can be found through links
the video description,

and explore it offline.

[Music]

Now that we have
the list of records –

that we should use
to play the speech,

let’s go through them.

Earlier I mentioned
that in my design,

each record may actually contain
more than one frame.

I decided upon three
different styles –

for playback of these frames.

The synthesizer might choose
one of the frames to play by random,

for some variation in the voice.

Or it might play all
of them in a sequence,

for use whenever a single
frame is not enough –

to capture the phoneme
clearly enough.

Or, in case of the trilled R,

it might rapidly cycle
through the frames.

Whatever the method,
we do need the actual synthesizer.

So let’s tackle that part now.

This is basically the same code
as in the LPC-to-WAV converter –

that I briefly showed earlier,

but let’s go through it
in more detail now.

I am using SFML
for this project.

This AudioDriver class
is basically identical –

to the one in the PCM video
that I made earlier.

Its job is to read
samples from an array –

and to push them
to the sound library.

There is nothing
too exciting about it.

The interesting part is –

where the LPC frames get
converted into wave audio.

In the context
of speech synthesis,

LPC works so that first
there is a source of noise.

A buzzer.

Something that generates
a voice that has a pitch.

Anything will do,
including music,

as long as it’s not
a pure sine wave.

It cannot be
a pure sine wave,

because the next step is
applying a FIR filter over it.

This filter either
attenuates or amplifies –

certain frequencies
of the buzz,

but it cannot make
them up from nothing.

The difference between
the buzz and the filter output –

is saved into a buffer.

The filter operates on the
differences between the buzz –

and past outputs
generated by the filter,

so we use a rolling buffer.

That’s what the
modulo operator does.

It makes sure that
the indexes loop back –

to the same indexes
over and over again.

The latest sample is
sent to the speaker.

In my design,

the audio chunk is first
saved into a temporary buffer,

and then moved into the buffer
shared by the audio engine.

This is so that we can
minimize the time –

that the audio buffer
has to be locked.

And this is what
it sounds like.

Mind you, this is going to be
Finnish-language text right now.

[Synthetic voice] “Now the word
of the Lord came unto Jonah –

the son of Amittai, saying:

‘Arise, go to Niniveh, the great city,
and proclaim against it’”—

It was already fairly understandable
to an average Finnish listener,

even if some phonemes were not
as clear as they could be.

There were three little problems
with that short sample.

First, the speech was
quite monotonous.

We could make it sound
more interesting –

by smoothly altering the pitch
and voice quality over time.

However, that's not enough.

I decided to actually model
a typical flow of pitch –

in Finnish text reading.

To do that, first,

the text is divided
into syllables –

using a rough algorithm –

that simply checks where
the vowels and consonants are,

and decides that a
new syllable begins –

where there is a single consonant
followed by a vowel.

Then, a pitch curve
is given to the sentence –

by keeping track where each sentence
begins and where it ends,

and giving a certain pitch
to the first and last syllable –

and interpolating the rest.

And this is what
it sounds like.

[Synthetic voice] “And it came to pass
in the days when the judges judged,

that there was a
famine in the land.

And a certain man
of Bethlehem in Judah –

went to sojourn in
the field of Moab;

he, and his wife,
and his two sons.”

The second problem
is quite obvious,

and quite annoying.

To be honest,

I can’t identify the cause of
the constant clicks and pops –

heard in the audio,

but I figured it’s best
to do something about them.

My workaround for the clicks
and pops is not very pretty;

it is pretty much equivalent
to fixing a broken television –

by beating it until it works,

but hey, it gets
the work done.

I also decided to smooth out
the frame boundaries a bit,

by making all the synthesis parameters
change smoothly, gradually.

And this is what
it sounds like.

[Synthetic voice]
“Blessed is the man –

that walks not in the
counsel of the ungodly,

nor stands in the
way of sinners,

nor sits in the
seat of the scornful;

But his delight is in
the Torah of the Lord—”

And that’s Finnish.

But the title of this
video was not –

”let’s make a Finnish
speech synthesizer!”

This video was about making
a speech synthesizer –

with a Finnish accent.

So there is still
work to do!

I have to make
it read English.

To make it read English,

I borrowed code from a very old
speech syntesis program –

called Rsynth,

which in turn borrows
from a research paper –

written at United States Naval
Research Laboratory –

in the year 1976.

I simplified the code a bit –

so that the two
source code files,

about 900 lines of code,

fit nicely in one screenful,

and I got myself a function
that converts English text –

into sort of an ASCII representation
of the International Phonetic Alphabet.

Then, I wrote a conversion table –

that reduces those phonemes into the
set of phonemes used in Finnish.

This function is then called
in the part of my program –

that deals with text to
phonemes conversion.

[Synthetic voice]
“And the result sounds like this:

To be fair, it is very
hard to understand.”

“This is maybe not exactly like
the typical Finnish accent,

but it is pretty close.

In a tongue-in-a-cheek manner.

The text-to-phoneme ruleset
is not exactly water-tight.”

Many people have been
joking about my accent,

suggesting that maybe I wrote
a speech synthesizer –

to do the voiceovers
for my videos.

Well, in case you ever wondered
what happened if I were to do that,

now you know!

If you liked what you saw,
thumbs-up the video –

and hit the subscribe button
if you haven’t already.

Hit the bell icon too,

to make sure you get
all notifications –

of my new uploads.

Thanks go to my supporters
at Patreon, Paypal, Liberapay,

and other sites.

I have not addressed you
in a video for a long time,

but you are very much
appreciated indeed.

As always, have a nice day
and a shalom in your life.