Why is this the case? For some systems, the local hardware is listening for this attention command by processing all audio it hears, sorting through sounds, looking for the "Attention Command." Once acknowledged as the correct command, the system then goes into cloud mode, where it sends your audio to the cloud to be processed. Here in lies the rub, for privacy we don't have a cloud system listening to our every conversation and sending it up to the cloud, so only when you get the attention of the system is that audio sent.
Given that the "Attention Command" is the only way a voice can activate the system, it then requires previous knowledge of the particular "Attention Command" and that the word itself is a command to start listening. If that is unknown, a user might speak the first three words of the sentence before the command to start listening has been activated. That's why the "Attention Command" with a pause after it is more effective.
The idea felt in the industry is to hide this complexity from the user, seamlessly providing a simulation of human dialogue. Example: "Alexa, what's the weather" - produces the correct response if the system is setup correctly. As we grow accustomed to asking the same thing each day, in such a nonchalant manner, we forget the uniqueness of this "weather command." It's just a single series of words that activates the request. But after the command has executed, Alexa is back to listening for Alexa, no longer focused on you. The reason is this AI is not 'an AI,' it is many AI systems acting as a single AI. It's just a big set of systems waiting for the next command. When we break down our example, “Alexa, what’s the weather?" we can see this simple question is a complicated process.
First, the hardware in Amazon’s Echo contains an AI listening and processing audio for the word “Alexa,” from all the sounds around it, all the time it is on. Listening and processing audio is the first AI that helps with this question. Local Command Word Processor (local speech to command) is AI #1,
Second, the audio containing the spoken words "what’s the weather" turns audio into text. This Voice to Text AI uses multitudes of recorded utterances to piece together the sounds into letters and words. In this case, it needed to understand a conjunction of “what is” as "what's." Voice to Text AI is AI #2.
Third, Natural Language Processing is used to turn, "what’s the weather?" into commands for this AI that specializes in the meaning of the text. Natural Language Processing turns, "what’s the weather" into commands. Natural Language Processing is AI #3.
When the intention of the user is determined, programs use a locally defined zip code and request the weather from the internet weather source for this location, then organizes the results into reply text. This text is, in turn, passed to AI #4, Text to Speech. It creates the returned text as audio, matched from multitudes of recorded utterances to construct the sounds into words, and words into sentences. Additionally, adding a cadence and creating an audio file is then passed to the Alexa speaker to play.
With four AI systems working together, the complexity of what is involved in delivering a spoken audio response from Alexa is enormous.
Teaching users how to use an AI system is a problem for the entire AI industry. It's because we don't address each other by first calling out their name, waiting for an acknowledgment before then speaking a command. When we want something from someone, we say it all together. For example, 'Al turn on the lights' will come out in one breath without a pause. It's not natural for humans to pause after calling the name of someone especially when we're feeling physically comfortable ourselves. The Industry problem I spoke of; these AIs all exist in that environment where we feel comfortable; therefore they must adapt to us for best adoption and lower attrition rate. The AI systems are in part, built for human comfort. Therefore, the nuance of a simple insertion of a pause is in the way of faster adoption. Chances are, if you know what you want and can see the sentence in your mind before saying it, then call out the AI start command, say, 'Alexa,' then wait for the system to acknowledge your request for attention, then speak the sentence. Then and only then you almost always get what you want and quickly.
Currently, a user must think about what information they want from our AI systems before speaking to them to ensure an accurate response. Understanding how these systems work and how we communicate with each other is an important first step towards full adoption.
For example, understanding command word requirements when using an AI platform with voice is essential. The acknowledgment command, usually the local hardware is listening for this command and does the processing of all the audio it hears, sorting through sounds, looking for the Attention Command. Given that the attention command is the only way to use your voice to activate the system, it requires previous knowledge of the command, and that the word itself is a command to start listening. If you don't know that, you might speak the first three words of the sentence before the command to start listening has been activated. That's why using a pause after the attention command is best. We see this as a penultimate need for our industry to bridge this gap.