Is Google Duplex Too Human?
Exploring Opaque Conversational Agents From a User Perspective
The voices of virtual assistants play a large role in defining human-computer relationships. This thesis project explores user social responses to different types of virtual assistant voices through an experimental design and a survey.
Synthetic speech that can mimic human language has been a long time coming, but it has arrived with Google Duplex. HCI researchers need to catch up with ethical design standards to deal with this new technology.
My findings advise researchers and designers to follow a set of ethical guidelines when considering the use of human synthetic voices, informed by an experimental survey design.
In this thesis project, I contribute to the dearth of literature on voice assistant HCI and conduct primary research. Tools: MTurk, Qualtrics dynamic survey design, statistical analysis, audio production, market analysis, topical modeling.
TRY FOR YOURSELF
Background \\ Previous Work \\ Research Questions
On May 9, 2018, Google released a new virtual assistant feature called Google Duplex. Unlike the typical bot voice we have come to expect from Google Assistant and most other products of its kind, Duplex surprised Google I/O conference attendees with a convincingly human voice. This voice was indistinguishable as a robot built upon natural language processing and AI components. This new display of technology brings us back to the many questions we ask about AI. Do humans really want technology that mimics human conversation? And could this erode our trust in what we hear?
Alarm bells rang in the field of Human Computer Interaction (HCI), and ethicists quickly voiced concerns that Google Duplex had failed to provide disclaimers— they were tricking people into believing they were really talking with another human. Google reacted by adding a disclaimer to Duplex interactions. They later also released a statement that the product would only be used by individuals and businesses who had signed up.
It is concerning that Google created this new technology without any clear ethical principles guiding the research and design of the product
Limited research in speech synthesis indicates that the actual voices of virtual assistants (i.e. Amazon Alexa) are important to study. There are not only ethical implications of how voice change perceptions of who or what technologies we are talking to, but there are user testing implications as well. Overall, the voices of virtual assistants play a large role in defining human-computer relationships.
This thesis project measures whether humans are tricked by machine voices, and whether Google Duplex elicits uncanny feelings by being “almost but not quite” human. Perhaps more importantly, I invite participants to share their thoughts on AI/virtual assistant ethics in an open-ended format to generate ideas for a guideline.
Bear with me while we take a short dive into some literature that provides a backbone to this project. I swear that it is worth your time — and you may learn something about human psychology and why Hal from 2001 Space Odyssey scared us so much.
This study is lies on a theoretical grounding of the Uncanny Valley of the Mind (Mori, 1970), a series of speech HCI findings compiled by Clifford Nass and Scott Braves (2005), Knapp’s Relationship Escalation Model (1978), and the Diffusion of Innovations Model (1962). Here, we’ll discuss the first two, but you can read my thesis if you want to hear it all.
Clifford Nass and Scott Brave’s book, Wired for Speech, synthesizes and conceptually expands upon findings from numerous speech and voice studies (2005). Their writings provided foundational literature for this project in areas of (1) understanding human evolution relating to sound (2) understanding a user’s ability to discern between recorded, synthetic, and human voices based on complex cues, and (3) understanding that voices, whether human or not, illicit social behaviors from human users.
The book was published at a time when the technology around generating nonhuman voices was rudimentary— help-lines utilized scripted human recordings or computer generated voices that relied on a simple rule based structure. Computer generated voices required human notation to attempt to breathe life into the voices. Overall, computer generated applications focused on transactional conversations such helplines for banking, checking airline reservations, ordering stocks, and navigate the web. Forays into relational conversations (such as with Eliza the psychologist chat assistant or Ananova the virtual newscaster) were experimental and clunky (Nass and Braves, 135).
While Nass and Brave conjectured that voice technologies would continue to improve, they could not image the rapid progress that would be introduced by the third wave of artificial intelligence, made possible by access to more data and GPUs to process that data. This third wave began around 2010 and is characterized by applying neural networks to cognitive processes such as vision and speech, with the goal of creating models that are capable of a higher order of understanding that approaches human cognition. Neural network approaches of this wave are often referred to as deep learning because they stack multiple layers of pattern recognition on top of each other to reach higher order interpretations.
As a result of the third wave of artificial intelligence, we have advanced systems to process and understand speech, as well as to produce speech in real-time conversations. We can now conduct research on voice technologies that Nass and Brave did not anticipate— such as a voice assistant that can quite accurately mimic human conversations.
We also learn from Nass and Braves that psychologists and designers alike have utilized understandings of how the brain processes voices to predict cues that will encourage or discourage perceptions of humanness (134). This is undoubtedly the case today in developing products such as Alexa, Siri, Google Assistant— and now Google Duplex. There are many voice cues that suggest if a person is listening to a nonhuman, such as through “pauses at inappropriate moments, emphasis on the wrong syllable or wrong word, rising and declining pitch at the wrong times, mispronunciation of words that humans generally pronounce correctly, and so on” (Nass and Braves, 2005, pg. 134). But what happens when we pick up from these complex cues that the voice we are listening to might not be human?
In 1970, Masahiro Mori hypothesized that there is a relationship between the degree of an object's resemblance to a human being and our emotional responses to that object. The concept of the uncanny valley, which has since been verified in numerous studies, suggests that human-like objects which appear almost— but not exactly— like real human beings elicit an uncanny reaction. This feeling includes a feeling of unfamiliarity, eeriness, and even revulsion by the user. The effect has primarily been studied with a combination of visuals and audio, such as with 3d renderings, robotics, and human-like dolls.
It turns out that no study has been done that specifically focuses on measuring the uncanny valley for an audio-only experience, which means we simply don’t know what emotional reaction users might have to a product like Google Duplex! This project begins to provide answers that question.
RQ1: Can end users reliably distinguish between a human voice and today’s advanced machine voices?
RQ2: What degree of realism do end users prefer when using a virtual assistant with a synthetic voice?
RQ3: What is the emotional response of end users when listening to virtual assistants with synthetic voices?
RQ4: Is this ethical? What do participants think? What are their concerns?
GOAL: Create a recommendation to researchers and designers on ethical creation of virtual assistants that meets user needs.
Preparation \\ Survey Experience
The research instrument for this study is a survey, posted as a Human Intelligence Task (HIT) on Amazon’s Mechanical Turk platform to recruit subjects (N = 405 valid responses). Survey questions are randomized and include internal and external validity checks.
Tools: Audio Production, Dynamic Survey Design on Qualtrics \\ MTurk \\ Python and SPSS for Analytics
I used the original script from a Google Duplex conversation and duplicated it with a human voice actor and with Google Assistant (using IFTTT).
I created a dynamic survey that included randomization and validity checks. This survey went through 3 rounds of pre-tests.
I received funding from the Dallas Morning News Innovation Fellowship to set up and compensate MTurk workers.
Participant Survey Experience
MTurk participants begin 10-15 minute survey.
Participants turn on their audio to listen to three recordings of human or machine voices in conversation, following an identical script.
While listening to these audio clips, participants fill out the Ho and MacDorman Questionnaire, designed to measure uncanny effects.
Then, participants rank audio by preference, rate the audio for any ‘uncomfortableness,’ and guess whether the speakers were human or machine.
Participants receive a debriefing in which they are truthfully informed about whether the voices in the audio were human/machine.
‘Primed’ participants discuss voice assistant ethics through hypothetical scenarios, agree/disagree statements, and open ended questions.
Participants answer questions on technology adoption and demographics. Finally participants are compensated.
RQ1 \\ RQ2 \\ RQ3 \\ RQ 4
of users believed that Google Duplex
was a human voice, not a machine
End users cannot reliably distinguish between a human and advanced machine voice. Users did reliably discern that the Google Assistant voice was synthetic and that the human voice was a genuine human. This simple measure illustrates that the technology has in fact advanced to a point where synthetic voices can trick the mind.
1st place goes to…
the human voice (phew).
But users ranked the Google Duplex
voice nearly as highly.
Weighted averages were used to calculate the rank order for the three audio choices. Before and after the debriefing, the order of preference remained the same, with the human voice being the most preferred. The Google Duplex voice came in second and the Google Assistant voice third. Notably, after the debriefing the Duplex voice fell slightly in popularity.
Uncanny or warm and fuzzy?
Users didn’t experience uncanny feelings from Google Duplex. They found it equally normal, humanistic, and attractive as a human voice.
This is a voice technology that traverses the uncanny valley. We likely won’t be feeling any of the spine tingling emotions that Hal made us feel in 2001 Space Odyssey.
Is this… ethical?
The answer is it all depends. 405 participants had
vastly different opinions on this matter.
What happens if the risks associated with interacting with a voice assistant increases? To what extent are we willing to put trust in a technology in life or death situations? This is addressed through a series of hypothetical scenarios involving interactions of booking a table, obtaining news information, online dating, psychological consulting, filling a medical prescription, and obtaining help in an emergency.
Users are more comfortable with the idea of using human-like voice assistants in low risk situations such as getting the news or booking an online table. As risk increases to matters such as emergency response, psychological counseling, and filling a medical prescription, users find this idea uncomfortable.
The scenario of online dating is fascinating: users are more uncomfortable with this hypothetical scenario than even relying on technology to properly filling a medical prescription. Why is this? I suspect that filling a subscription or calling 911 tends to me a transactional type of conversation that involves largely exchanging facts. Meanwhile, online dating is relational and emotional. The Uncanny Valley theory discusses that, as encounters with machines approach intimate or emotional relationships, users may become increasingly wary. It may be more foreseeable that a voice assistant, using advanced processing to communicate and make decision, might be able to provide a more efficient emergency hotline than 911 currently provides.
There was no general consensus across on whether products should include disclaimers, whether ethical guidelines should be context specific or blanket statements, or whether participants would like voice assistants to sound like human. Shown below, it is clear that there was a wide range of answers.
The first open ended question asked participants if they had specific concerns about voice assistants that could mimic human speech. The biggest concern raised was on the topic of data privacy and crime. Participants are concerned about crimes such as scams, phishing, and identity fraud. There were also concerns for how it could increase the amount of spam, telemarketing, and ads users are exposed to. About a quarter of the participants did not think there were any particular issues that could come from this technology, apart from the loss of jobs. Another quarter stated that they would choose to not interact with such a technology at all. A few participants stated that there should be clear disclaimers from voice assistants, but that they were interested in seeing the industry progress.
The second open-ended question asked participants to list ideas they might have for controlling or lowering risks of this technology. This question yielded the most interesting and diverse set of answers in the entirety of the survey. Answers included specific steps, such as disclaimers, government intervention, and beholding companies to guidelines and rigorous testing. There were also a large number of responses that voiced concerns that there was no way to lower risk and that we might face a future in which AI becomes more intelligent than humans.
A key insight was that multiple participants stated that they wouldn’t have been aware of these potential problems had they not experienced being “tricked” by Google Duplex. Companies can do a great deal through design and PR messaging to negate ethical concerns from the public, but when users find out, they feel disconcerted.
A GUIDE TO ETHICAL VOICE ASSISTANTS
Based on the results from my research, I suggest the following guidelines. Remember that my experiment is limited. I had 405 participants. I analyzed reactions to 1 gender (male) and explored 1 specific conversation script (booking a restaurant table). If you’d like to collaborate to build on this research, contact me.