Whose Voice is it Anyway?

I’m such a geek that I just love watching The Big Bang Theory on TV.  The guys are often arguing about some nerdy topic like Star Trek, the space-time continuum, or String Theory.  Some of my favorite episodes have been when Stephen Hawking has guest-starred on the show.  There’s just nothing like watching fictional genius, Sheldon, go against real-life genius, Stephen, in a game of Words with Friends on Facebook.  It is one of the most popular shows on American television and people who weren’t familiar with Stephen Hawking have now been exposed to him. If you have ever heard Dr. Hawking, you are familiar with the robotic voice that he uses to communicate.  And, if you’re a developer how much do you worry about that voice? Should you even care at all?  The answer is yes!

Notice that I didn’t call it “Stephen Hawking’s voice.”  This is because it’s not really Dr. Hawking’s voice. It’s the synthesizer used by his assistive technology.  The same is true for screen readers.  Often, I will hear people say something like “the JAWS voice”, “the Window-Eyes voice”, “the VoiceOver voice”, or “the NVDA voice.”  However, the reality is that none of the screen readers has its own synthesizer.  To think so is a misnomer that has long been portrayed by people both in and outside of the assistive technology world.  I’ve even heard assistive technology users discuss it in this manner. 

So, who or what is behind the voice? Let’s take a trek through the history of synthesized speech and find out.  <cue music> A long time ago in a land far, far away, computers could not produce more than one sound at a time.  Of course, that’s assuming they could even produce a sound at all.  If you were lucky, your machine had a sound card.  If you were extremely lucky (and probably extremely wealthy), you had a sound card that could play two things at the same time, called a multi-channel sound card.  Most systems could only beep or make one sound at a time, making it impossible to have speech and play another sound simultaneously.  In this day and age where our devices can do almost anything, we take this type of functionality for granted.  But, in yesteryear, this was commonplace.

Synthesizer History Class

Don’t worry – there won’t be a quiz on dates. <smile>  To produce speech way back in the day just after dinosaurs became extinct, speech synthesizers were created. Some of the first synthesizers were completely separate from a computer, such as the; Voice Operation DemonstratER (VODER) from 1939.  In the latter part of the twentieth century, synthesizers were a separate hardware device that a user would connect to a computer; usually via a serial port.  A screen reader was installed on the system, and it took the information from the screen and sent it to the synthesizer, which spoke the data the screen reader passed to it.  Fast forward a few years as technology evolved and the synthesizer that was previously on a separate hardware device made its way into software.  As multi-channel sound cards became standard on computers and processing power became faster, it was possible to use a software synthesizer instead of a hardware synthesizer.

Despite the synthesizer now being inside the computer as a software component, the premise remained the same for screen readers.  They would determine what items to speak and then pass it on to the synthesizer, which actually speaks the content.  So, just as it is not “Stephen Hawking’s voice”; it’s not the screen reader’s voice, either.  Instead, it is really the synthesizer’s voice.

Why are those darn synthesizers so robotic?

Now that you understand that it is really the synthesizer speaking, why do they sound so robotic like the one Dr. Hawking uses?  Many people who are blind listen at extremely fast rates.  To listen at these high rates, it wasn’t possible (until recently) for a synthesizer to sound human and speak fast.  I’ll give you an example.  Speak the following out loud to yourself in your natural tone, “I enjoy a very nice, warm, and sunny day!”  Now, try speaking the same sentence extremely enthusiastically as if you were reading a storybook to a child.  With that exact same enthusiasm, speak it as fast as you possibly can.  Most of the time, you will lose the enthusiasm in your voice in exchange for a slightly more robotic tone.  When trying to do this with a software synthesizer, it would not be able to speak various syllables at an extremely high rate of speech.  Or, if a synthesizer were able to solve this problem, it would not be very responsive, which is extremely annoying for a blind computer user.  Imagine pressing the Down Arrow to hear a line of speech, and having to wait 2 seconds for it to read!  You can experience this for yourself if you are sighted.  Just read a line of text on this article and then wait 2 seconds before reading the next line.  Annoying, isn’t it?

In order to keep a synthesizer that could speak at very fast rates and be responsive, users tended to stay with a more robotic sound.  In fact, many users become so accustomed to this speech that they stick with it.  For example, you may have heard someone say, “That sounds like the JAWS voice.”  What they really mean is “That sounds like Eloquence.”  Eloquence, of course, is the name of the synthesizer.  In more recent years, consumers have come to expect human sounding speech, so many assistive technologies now include both robotic sounding synthesizers and human-sounding synthesizers like NeoSpeech, Vocalizer, and Vocalizer Expressive.  Apple introduced the synthesizer, Alex, which even takes breaths like a real human.  We are seeing a shift as technology improves to make human-sounding synthesizers viable alternatives.  Additionally, many people are losing their sight at an older age, so they expect their computer to talk like a real person.

The Pitfalls of synthesizers for developers

How does this affect web developers and others who require their applications to be accessible?  One of the mistakes I have seen developers make is try to make their application so blind-friendly that they spend way too much time on the wrong things. For example, suppose the word “Wednesday” has been abbreviated as “Wed.”  I’ve seen developers spend hours trying make sure that a screen reader speaks “Wednesday” instead of “Wed.”  Unknown to many developers, most screen readers enable their users to change the way text is pronounced. So, if the user wants to hear “Wed.” as “Wednesday,” the user has the ability.  Furthermore, different synthesizers pronounce words differently.  This means that even if a developer spent hours trying to get a synthesizer to speak a word how they want it to sound, a screen reader user may decide to use a different synthesizer in their screen reader, which may defeat the entire purpose of trying to make the word speak correctly in the first place.  As another example, older synthesizers like DecTalk Access 32 used to speak the word Muncie (a city in Indiana where the prominent Ball State University is located) as “monkey.”  A web developer may try all sorts of coding techniques to try to change the way this speaks, but the issue is not with development, it is with the screen reader.  The burden is thus on the screen reader user and not the developer.

Pronunciation Burden – Let the User Handle It

I’m constantly trying to teach my two-year-old daughter how to correctly pronounce words, and screen readers can do the same thing with their software.  Let’s take a look at how a JAWS user can change how a word speaks. Below is a screenshot of the Dictionary Manager in JAWS, where users can put in a word and then determine how they want it to speak. Note that the Dictionary Manager has entries for multiple languages and it also has multiple entries for different synthesizers:

I have a friend named Caleb.  Eloquence, which I use as my default synthesizer for JAWS, speaks Caleb as “cah lehb.”  I can change the way the word is pronounced by adding an entry to the Dictionary Manager.  The first thing a user can do is to insert the word that they want to be spoken differently. In this instance, I have typed, “Caleb.”

Next, the user inputs a phonetic spelling for how they want it to sound.

I have phonetically spelled “Caleb” as “Kayleb.”  I applied this to all languages and all synthesizers, so after I accept these changes, Eloquence (and any other synthesizer that JAWS uses) will now pronounce “Caleb” correctly every time it reads it.

Whose Voice Was It, Sheldon?

Now you know whose voice it really is, why you shouldn’t worry about it as a developer, and a tip on how to change how a word is pronounced in JAWS.  Most importantly, you can impress your friends the next time you watch The Big Bang Theory by explaining that Stephen Hawking is using a synthesizer. And, in the spirit of Sheldon, you can impress your friends further by telling them which company manufactured the synthesizer and what year it was created. Do you know the answer?  Tweet us and we’ll let you know if you are correct!

As you have learned, when developing content, it is important to focus on the items that affect accessibility rather than waste time trying to fix things that can be easily fixed by the user. If you or your organization needs help making your software product accessible, please consider contacting the accessibility experts at Interactive Accessibility.

Categories: World of Accessibility