To be, or not to be: Should voice interfaces (VUIs) try to be human-like?

Today’s speech and voice interfaces are striving to imitate human conversation. Is this, however, always the best design choice?

Suchithra Sathiyamurthy
UX Collective

--

A mask that is part robot and part human on a background of binary numbers.

While Graphical User Interfaces (GUIs) have supported humans in communication and the execution of a wide range of tasks for around four decades now, users are increasingly turning to Speech Interfaces or Voice User Interfaces (VUIs). Human dialogue has been the most common metaphor for understanding and designing such voice-based interactions. Companies in the commercial sphere, such as Microsoft, Google, and a slew of others, have been focusing on developing more realistic and human-like voice technology in the hopes of increasing user acceptance. While this is enticing in terms of innovation, there are concerns about this design choice (Aylett et al., 2019; Moore, 2017a).

The expectation gap

Similar to the concept of mental models in GUIs, Cowan et al. (2015) remark that users interacting with VUIs have assumptions and expectations about their capabilities, which they refer to as partner models. Previous research on speech interfaces has found that a mismatch between such user expectations based on perceptual cues and their abilities can lead to the verbal ‘uncanny valley effect’ — a feeling of unease and eeriness toward non-human entities that are approaching human-likeness but fall short of expected human characteristics. (Clark et al., 2014; Clark et al., 2021; Moore, 2012; Mori et al., 2012). This sense of discomfort could degrade the user experience and limit further interactions with these interfaces (Clark, 2018; Clark et al., 2021). While GUIs communicate their range of functions through visual screens, voice-only interfaces present such cues only through their speech. In these cases, users’ inability to grasp the capacities of “conversational” agents leads to task desertion, particularly if there are functional flaws and continuing the work will cause social embarrassment (Luger & Sellen, 2016).

Imitation vs Inspiration

While using human speech as a model to improve user experience may be beneficial, it does not have to be the end goal (Aylett et al., 2019). Although Nass & Brave (2005) claim that humans relate to VUIs in the same manner they relate to other humans, recent research shows they view these agents as unequal partners, with an emphasis on a master-servant relationship (Doyle et al., 2019). There is a fundamental limit to how much dialogue using such interfaces can resemble human-to-human communication, necessitating alternative approaches to designing these interactions (Moore, 2017b).

Users’ perceptions of human likeness in verbal communication are multidimensional, and each of these dimensions must be addressed differently in VUIs (Doyle et al., 2019). However, bestowing only certain traits of humanness can lead users to believe that all its other traits are also present, implying an all-or-nothing approach when emulating humanness (Moore, 2017b). Users are also less tolerant of the communication limits of Intelligent Personal Assistants (IPAs) when compared to humans, and this could decrease user engagement (Doyle et al., 2019). As a result, it is critical to create congruent voice-behaviour and visual-voice-behaviour pairs in non-embodied and embodied voice agents, respectively.

As Aylett et al. (2019) point out, users experience VUIs in a variety of ways and thus, humanness can serve as one of the many sources of design inspiration. Examples such as the voice for the KFC robot, H.A.R.L.A.N.D (Human Assisted Robotic Linguistic Animatronic Networked Device), illustrate how synthetic voices may mimic human voices while still sounding engaging, less eerie, and natural (Aylett et al., 2019). Research also shows that users highlighted the combination of audio from IPAs and visual feedback as a source of human likeness, comparing the latter to feedback provided through human facial expressions and hand gestures (Doyle et al., 2019). Future research and work could shift the focus away from ‘humanness’ in human conversations and draw inspiration from communication between incompatible partners, such as native and non-native speakers, adults and children, or even humans and dogs (Moore, 2017b).

Purpose, Context & Situation

On the other hand, natural voices have been shown to have a lower cognitive load in users than synthetic voices (Francis & Nusbaum, 2009; Simantiraki, Cooke, & King, 2018;). This encourages using more human-like voices in lengthier conversations (Torre & White, 2021). Areas like healthcare and elderly care, where fostering social and meaningful relationships is critical, may warrant a more human-like ability in VUIs (Bickmore et al., 2018; Sabelli et al., 2011). Although users are less interested in forming relationships with IPAs, buying behaviour through these agents shows heightened perceptions of control, focused attention, and exploratory behaviour during encounters with humanized voice assistants (Clark et al., 2019b; Poushneh, 2021).

According to Suchman (1985), human communication is based on a shared knowledge of context and situation, and establishing this common ground between machines and humans can be difficult. Popular voice assistants, such as Amazon Echo, are mostly used at home in family settings and confront issues in comprehending social rules such as situational awareness, turn-taking, detecting and responding to multiple voices (Luria et al., 2017; Porcheron et al., 2018; Pyae., et al., 2018; ter Maat et al., 2010). Understanding social norms is a key component of human discourse, and failures in this area could lead to a gap between user expectations and agent capabilities.

Predictive algorithms in apps like Instagram, Amazon, and others can adapt to users’ status concerning time, location, mood, identity, and a variety of other factors to deliver highly context-aware content. Research on voice agents with such capabilities is ongoing, and it may lead to the development of artificial voices that are flexible enough to adjust their humanness in response to the user’s state and environment (Microsoft, n.d.).

Current perception of VUIs

Human-computer speech interaction is the one agent interaction that has come closest to matching communication in the real world and the inclination is to equate it to human conversation. However, we cannot ignore the fact that, in terms of human-machine interactions, the use of GUIs precedes that of VUIs, and this will affect how they are regarded. Although GUI design seeks to elicit emotion in users, the interface serves as a mediator in this case, and the relationship is one of utility (Sharp et al., 2019). Given that VUIs now execute fewer operations than GUIs, expectations for their capabilities will be limited as well. Doyle et al. (2019) note that users link IPAs’ identity with commerciality, which in turn makes them difficult to be accepted in a social capacity. These are some potential roadblocks to completely adopting human-like characteristics in VUIs and therefore, designing and evaluating them using adapted versions of standard usability heuristics could also be relevant and appropriate (Murad et al., 2018).

Well, what is the way forward?

With the expanding capacities and pervasiveness of technology, the performance and familiarity of VUIs may evolve, and as a result, what is considered machine-like and what is expected of these interfaces may alter over time (Clark et al., 2021). Until then, we must solve several usability issues that have proven to be significant roadblocks in our efforts to create positive user experiences (Budiu & Laubheimer, 2018; Pyae et al., 2018). While adopting human speech as a model can help with some design challenges, it should not be used as a one-size-fits-all solution. We need to assess the role of these speech agents in our lives and make design choices that are informed by context and purpose.

This essay was originally written as a part of my coursework for MSc Human Computer Interaction at University College Dublin.

Follow me on Instagram @suchithra.ux

References

Aylett, M. P., Cowan, B. R., & Clark, L. (2019). Siri, Echo and Performance: You Have to Suffer Darling. Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, 1–10. https://doi.org/10.1145/3290607.3310422

Bickmore, T. W., Trinh, H., Olafsson, S., O’Leary, T. K., Asadi, R., Rickles, N. M., & Cruz, R. (2018). Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. J Med Internet Research, 20(9), e11510. https://doi.org/10.2196/11510

Budiu, R. & Laubheimer, P. (2018, July 22). Intelligent Assistants Have Poor Usability: A User Study of Alexa, Google Assistant, and Siri. Nielsen and Norman Group. https://www.nngroup.com/articles/intelligent-assistant-usability/

Clark, L. M. H., Bachour, K., Ofemile, A., Adolphs, S., & Rodden, T. (2014). Potential of Imprecision: Exploring Vague Language in Agent Instructors. Proceedings of the Second International Conference on Human-Agent Interaction, 339–344. https://doi.org/10.1145/2658861.2658895

Clark, L. (2018). Social boundaries of appropriate speech in HCI: A politeness perspective. In Bond, R. (ed.). Proceedings of the 32nd International BCS Human Computer Interaction Conference (HCI 2018). https://doi.org/10.14236/ewic/HCI2018.76

Clark, L., Doyle, P., Garaialde, D., Gilmartin, E., Schlögl, S., Edlund, J., Aylett, M., Cabral, J., Munteanu, C., Edwards, J., & R Cowan, B. (2019a). The State of Speech in HCI: Trends, Themes and Challenges. Interacting with Computers, 31(4), 349–371. https://doi.org/10.1093/iwc/iwz016

Clark, L., Pantidi, N., Cooney, O., Doyle, P., Garaialde, D., Edwards, J., Spillane, B., Gilmartin, E., Murad, C., Munteanu, C., Wade, V., & Cowan, B. R. (2019b). What Makes a Good Conversation? Challenges in Designing Truly Conversational Agents. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, (pp. 1–12). Association for Computing Machinery. https://doi.org/10.1145/3290605.3300705

Clark, L., Ofemile, A., & Cowan, B. R. (2021). Exploring Verbal Uncanny Valley Effects with Vague Language in Computer Speech. In B. Weiss, J. Trouvain, M. Barkat-Defradas, & J. J. Ohala (Eds.), Voice Attractiveness: Studies on Sexy, Likable, and Charismatic Speakers. 317–330. Springer Singapore. https://doi.org/10.1007/978-981-15-6627-1_17

Cowan, B. R., Branigan, H. P., Obregón, M., Bugis, E., & Beale, R. (2015). Voice anthropomorphism, interlocutor modelling and alignment effects on syntactic choices in human−computer dialogue. International Journal of Human-Computer Studies, 83, 27–42. https://doi.org/https://doi.org/10.1016/j.ijhcs.2015.05.00

Doyle, P. R., Edwards, J., Dumbleton, O., Clark, L., & Cowan, B. R. (2019). Mapping Perceptions of Humanness in Intelligent Personal Assistant Interaction. Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. https://doi.org/10.1145/3338286.3340116

Francis, A. L., & Nusbaum, H. C. (2009). Effects of intelligibility on working memory demand for 532 speech perception. Attention, Perception, & Psychophysics, 71(6), 1360–1374.

Luger, E., & Sellen, A. (2016). “Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 5286–5297. https://doi.org/10.1145/2858036.2858288

Luria, M., Hoffman, G., & Zuckerman, O. (2017). Comparing Social Robot, Screen and Voice Interfaces for Smart-Home Control. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 580–628. https://doi.org/10.1145/3025453.3025786

Microsoft. (n.d). Situated Interaction. https://www.microsoft.com/en-us/research/project/situated-interaction/

Moore, R. K. (2012). A Bayesian explanation of the ‘Uncanny Valley’ effect and related psychological phenomena. Scientific Reports, 2(1), 864. https://doi.org/10.1038/srep00864

Moore, R. K. (2017a). Appropriate Voices for Artefacts: Some Key Insights. In 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots.

Moore, R. K. (2017b). Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction. In K. Jokinen & G. Wilcock (Eds.), Dialogues with Social Robots: Enablements, Analyses, and Evaluation (pp. 281–291). Springer Singapore. https://doi.org/10.1007/978-981-10-2585-3_22

Mori, M., MacDorman, K. F., & Kageki, N. (2012). The Uncanny Valley [From the Field]. IEEE Robotics Automation Magazine, 19(2), 98–100. https://doi.org/10.1109/MRA.2012.2192811

Murad, C., Munteanu, C., Clark, L., & Cowan, B. R. (2018). Design Guidelines for Hands-Free Speech Interaction. Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, 269–276. https://doi.org/10.1145/3236112.3236149

Nass, C., & Brave, S. (2005). Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. The MIT Press.

Priest, D. (2020, March 12). Alexa and Google Assistant are developing personalities. Computer Network (CNET). https://www.cnet.com/home/smart-home/alexa-and-google-assistant-are-developing-personalities/

Porcheron, M., Fischer, J. E., Reeves, S., & Sharples, S. (2018). Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1–12). Association for Computing Machinery. https://doi.org/10.1145/3173574.3174214

Poushneh, A. (2021). Humanizing voice assistant: The impact of voice assistant personality on consumers’ attitudes and behaviors. Journal of Retailing and Consumer Services, 58, 102283. https://doi.org/https://doi.org/10.1016/j.jretconser.2020.102283

Pyae, A., & Joelsson, T. N. (2018). Investigating the Usability and User Experiences of Voice User Interface: A Case of Google Home Smart Speaker. Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, 127–131. https://doi.org/10.1145/3236112.3236130

Sabelli, A. M., Kanda, T., & Hagita, N. (2011). A Conversational Robot in an Elderly Care Center: An Ethnographic Study. Proceedings of the 6th International Conference on Human-Robot Interaction, 37–44. https://doi.org/10.1145/1957656.1957669

Sharp, H., Rogers, Y., & Preece, J. (2019). Interaction Design: Beyond Human-Computer Interaction (5th edition). Wiley.

Simantiraki, O., Cooke, M., King, S. (2018). Impact of Different Speech Types on Listening Effort. Proc. Interspeech 2018, 2267–2271. https://doi.org/10.21437/Interspeech.2018-1358.

Suchman, L. A. (1987). Plans and situated actions: The problem of human-machine communication. Cambridge University Press.

ter Maat, M., Truong, K. P., & Heylen, D. (2010). How Turn-Taking Strategies Influence Users’ Impressions of an Agent BT — Intelligent Virtual Agents (J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, & A. Safonova (eds.); pp. 441–453). Springer Berlin Heidelberg.

Theodoridou, A., Rowe, A. C., Penton-Voak, I. S., & Rogers, P. J. (2009). Oxytocin and social perception: oxytocin increases perceived facial trustworthiness and attractiveness. Hormones and behavior, 56(1), 128–132. https://doi.org/10.1016/j.yhbeh.2009.03.019

Torre, I., & White, L. (2021). Trust in Vocal Human-Robot Interaction: Implications for Robot Voice Design. In B. Weiss, J. Trouvain, M. Barkat-Defradas, & J. J. Ohala (Eds.), Voice Attractiveness: Studies on Sexy, Likable, and Charismatic Speakers (pp. 299–316). Springer Singapore. https://doi.org/10.1007/978-981-15-6627-1_16

Völkel, S. T., Buschek, D., Eiband, M., Cowan, B. R., & Hussmann, H. (2021). Eliciting and Analysing Users’ Envisioned Dialogues with Perfect Voice Assistants. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery. https://doi.org/10.1145/3411764.3445536

--

--

Reflecting on design from the broad and the narrow | Product designer @AristaNetworks | Dublin | Views are my own