The current AI systems, such as Amazon's wireless speaker, Echo, with the Alexa assistant, is connected to your Amazon data it uses to generate responses. Alexa consists of groups of cloud-based and local-based AI systems that work together to perform a single command, or at most two or three at a time*; only after activated correctly (* josh.ai and Google can process multiple commands given at one time)
The industry is hiding this complexity from the user, seamlessly providing a simulation of human dialogue. For example, if we use Alexa to examine the difference - "Alexa, what's the weather" - produces the correct response if the system is setup correctly. As we grow accustomed to asking the same thing each day, in such a nonchalant manner, we forget the uniqueness of this 'weather command.' It's just a single command and a single way to activate the response. But after the command has executed, Alexa is back to listening for "Alexa," and no longer focused on the user. This medium is not AI; it is just a big set of deep learning AI systems acting as a single responder and waiting for the next command.
When we break down the example, “Alexa, what’s the weather?", a complicated process takes place. First, the hardware in Amazon’s Echo contains an AI listening and processing audio for the word “Alexa,” from all the sounds around it, all the time it is on. This listening and processing are the first AI assistant that helps with this question. Let's call it AI #1, Local Command Word Processor (local speech to command)
Second, the audio containing the spoken words turns "what’s the weather" from audio into text. This Voice to Text AI uses multitudes of recorded utterances to piece together the sounds into letters and words. In this case, it needed to understand a conjunction of “what is” as "what's." The Voice to Text AI supports AI #2.
Third, another AI that specializes in the meaning of the text is involved. This Natural Language Processing AI turns "what’s the weather" into commands, and we can call it AI #3.
When the intention of the user is determined, programs use a locally defined zip code and request the weather from an Internet weather location source, then organizes the results into reply text. This text is, in turn, passed to the AI #4 that helps with this question. Text to Speech creates the returned text as audio, matched from multitudes of recorded utterances to construct the sounds into words, and words into sentences. Additionally, adding a cadence and creating an audio file is passed to the Alexa speaker to play.
With four AI systems working together, the complexity of what is involved in delivering a spoken audio response from Alexa is enormous. At this point, the solution AI systems are solving for is becoming a problem for the AI industry as a whole: how to teach users to manage an AI system.
Making the system so easy to use that there is no thought involved is premature. It is still critical that we think about what we want from our AI systems before speaking to one if we want to ensure an accurate response.