The Power of Speech: Connecting Audiences in Dialogue with Voice User Interfaces
AbstractVoice-enabled interfaces are challenging the long dominance of graphical user interfaces in creating interactions that feel more personalized, authentic and approachable. This paper offers a case study on VUI (Voice User Interfaces) in connection with an interactive component of an exhibition, Time to Act. This case study reviews how our team developed a VUI interface – the content choices, the technical challenges, along with detailing some of our found solutions. Time to Act: Rohingya Voices, is an exhibition that opened in the summer of 2019, created by the Canadian Museum for Human Rights (CMHR). The exhibition compassionately portrays the plight of the Rohingya people, whom after decades of violent persecution, have suffered genocide at the hands of Myanmar military forces and an ongoing humanitarian crisis. People and communities worldwide are compelled to consider what action to take. One of the challenges for this exhibition was to immerse visitors in its content in a way that emphasized its relevance, encouraged empathy, and directly connected them to the subject matter. Leveraging VUI as the primary interface, along with supplementary visual, auditory, and tactile components enabled voice interaction between museum visitors and Rohingya now living in Canada. Careful consideration of the CMHR's universal design approach also paved the way for innovation, allowing for verbal, non-verbal, tactile, visual, and auditory elements that supported a fully accessible experience for audiences of all abilities. Employing voice as a primary input method led to strong performance and evaluation of visitor engagement, pointing the way towards a more integrated and holistic user experience.
Keywords: 3D printing, artificial intelligence, inclusive design, dialogue, oral history, emerging technology
The Canadian Museum for Human Rights (CMHR) strives to create space for meaningful connections with its subject matter and, perhaps more importantly, create space for transformative conversations. With human rights subject matter, conversations can often help people gain insight into complex problems where answers may not belong to a single perspective, or may not exist at all.
The museum’s mandate is to explore the subject of human rights, with special but not exclusive reference to Canada, in order to enhance the public’s understanding of human rights, to promote respect for others, and to encourage reflection and dialogue. As the world’s first museum exclusively dedicated to human rights, it is centred around the idea that respect for others and understanding the importance of human rights can serve as a positive force for change in the world. Dialogue is one of the foundational concepts through which the Museum promotes and preserves Canadian heritage, inspires research and engages audiences.
Given the Museum’s focus on dialogue and oral history, Voice User Interfaces (VUI) offer an incredible opportunity for experience design and exploration. As one of the most digitally advanced museums in North America, and most inclusive museums in the world, the CMHR continues to explore opportunistic ways to use technology to connect visitors with content. The focus of its digital experience design is to create pedestrian encounters with technology that are transmedia based and multifaceted, ensuring that audiences of all abilities are able to connect with its content.
A custom voice-first interface was developed as an empathic approach for visitors to experience as part of Time to Act: Rohingya Voices, a temporary exhibition that opened in June 2019 at the CMHR. From an experience design model, the voice-driven exhibit component sought to explore the question: “What happens when a visitor can ask an exhibit a question, using their own voice?” This paper will examine a spectrum of considerations when designing and implementing VUI interfaces.
During the latter part of the twentieth century, the popularity of personal computing was greatly influenced by the use of Graphic User Interfaces (GUI) to aid human-computer interaction. Over the last decade, devices like smartphones, tablets, and televisions are now being enhanced with the addition of voice-control systems (Klein, 2015).
VUI can be used as primary or supplementary interfaces that enable interaction between people and technology. VUIs are often paired with visual, auditory, and tactile elements to reinforce or complement user input and output, as well as increase empathic response (Kraus, 2017). This is not to ignore the value of screens for efficient display of large amounts of information. However, the value in combining tactile, visual and voice-driven interfaces allows for greater capacity to address creative and accessible solutions to digital experience design.
Today, one in four adults in the USA have access to a smart speaker, according to new research from Voicebot.ai (https://voicebot.ai/google-home-google-assistant-stats/). This works out to an adoption rate of around 26 percent or 66.4 million American adults. VUI has emerged as potentially the most ubiquitous user-facing solution since mobile computing and responsive design (Whitenton, 2017).
Designing experiences for voice anticipates two separate functions: voice inputs and audio outputs. In considering design models for VUI, it is important to relate voice-first interaction to the idea of multimodal interaction. This encompasses both input and output, along with points of feedback, in creating interactions that are more tangible by using different sensory channels. By focusing on complementary multimodal information (haptic, audio, tactile, visual), voice can increase the total amount of information received and has a beneficial effect on interaction (Milekic, 2002).
At a basic level, human voice is one of the easiest and most intuitive ways to communicate. While it is not a complete replacement for visualization, it allows users to quickly give commands to an interface, using their own terms. Voice allows users to multitask through use of natural language, and can bypass the need for complex navigation (Bania, 2018). Natural language processing also offers opportunities for greater accessibility by users who experience low-vision or blindness. However, it is important to note that a barrier for VUI success with Deaf audiences requires close attention to tangible and GUI considerations for additional feedback.
Time to Act: Rohingya Voices was developed by the Museum in collaboration with photojournalist Kevin Frayer and a group of Rohingya and Burmese community members who live in Canada. The exhibition compassionately portrays the plight of the Rohingya people, whom after decades of violent persecution, have suffered genocide at the hands of Myanmar military forces and an ongoing humanitarian crisis. The exhibition also captures their call for the world to recognize and resist Myanmar’s efforts to dehumanize and eradicate them. People and communities worldwide are compelled to consider what action to take.
The primary experience in gallery is a series of large-scale projections of Kevin Frayer’s powerful black-and-white images depicting the exodus of the Rohingya people. The display of these images is further augmented by two large-scale tactile photographs with audio descriptions, as well as quotes from members of the community who participated in the development of the exhibition.
The secondary element consists of two-parts. The first is a community wall that displays candid photographs and videos taken by the community of the refugee camps and conditions of the refugee crisis in Bangladesh, as well as community life in Canada. The second in an Oral History Interactive (OHI) that features VUI as its primary input.
The experience design intent behind the OHI was to present the voices of the Rohingya community in a dialogic way as a juxtaposition to the journalistic and community photographs presented in the space. The interactive draws a link between the visitor’s own voice, agency and concepts of oral history through active participation that can also be shared with other visitors who may be passively watching the activity. Leveraging VUI as the primary interface, along with complementary visual, auditory and tactile components enables voice interaction between museum visitors and Rohingya people now living in Canada.
This technology approach was informed by aspects of many similar projects that leverage natural voice interfaces, such as chatbots, Alexa, Siri, Google Assistant, and luminary projects such as New Dimensions in Testimony created by the Shoah Foundation(https://sfi.usc.edu/dit).
Leveraging an iterative and inclusive design approach, the Museum explored the core experience and began investigating how devices listen to us. To ensure a diverse and inclusive perspective, the design team held four rounds of user testing with
staff members from cross-disciplinary teams, including front-facing and back-of-house staff who had deep knowledge in facilitating dialogue, as well as volunteers and students. Each round of user testing built complexity upon the previous solution to prove or disprove features and considerations in the refinement of the experience.
Starting with a minimal viable product, our initial round of prototyping, used as a test of the basic mode of interaction, consisted of a blended model of generic questions and photos of the oral history participants, using a keyboard as input and a monitor for feedback. The design team invited staff to interact with the model and received a large amount of support and interest in the experience. The team also learned a lower threshold for interaction was need to make it easier for users to test the system. The initial prototype was a success and the project received full approval to continue.
Working with the community to express the goals of the exhibit component, we selected 12 recorded oral histories to form the content of the interactive. Rather than having an individual conversation with one person representing the entire community, it was determined that it would be much more representative if, when a user asked a question, any one of the 12 individual Rohingya people would respond. This also meant that users could ask the same question multiple times and experience different answers from multiple perspectives.
While developing the mechanics of experience design, the research and curation team reviewed and selected clips, then categorized them based on oral history interview questions as well as considering how they aligned to the overall exhibition goals. Based on the requirement that the experience should mimic a conversation, clips were also considered based on length. After several rounds of review and testing, the following questions were developed to form the basis of the VUI experience:
- Where were you born?
- Can you tell me about your family?
- What does it mean to be Rohingya?
- What was your experience coming to Canada?
- How is your life in Canada?
- What do you want people to know about the Rohingya genocide?
- What can people do to take action?
Understanding the questions and the intent behind them allowed the design team to identify what keywords and patterns to look for when natural language detection was taking place. The design intent was to recognize as many variations of the questions as possible, and that these questions would appear as suggestions (e.g., “Here are some questions you can ask:”) to help triage errors when the system did not understand what the user was asking verbally.
Initially, the team explored integrating an Alexa-based device, but quickly moved to a custom system as the Alexa device did not provide the customizability needed. After trying a few API-based voice recognition systems, along with Alexa, our team began exploring Google’s Dialogflow (https://dialogflow.com). The chosen system utilizes Google’s Dialogflow for performing automatic speech recognition (ASR) and natural language processing (NLP). The application GUI was created with Electron. The physical interface was built using Arduino and emulates a USB keyboard. Note: Since starting this project, several local natural language processors have become available which would have allowed eliminating cloud-based processing such as Google Dialogflow. The advantage of this would have been the removal of the current requirement to have a live Internet connection.
An early opportunity to simplify the interaction was removing any expected cues for a continuous listening agent. Due to the emergence of privacy concerns when it comes to VUI, many members of the testing group expressed concerns about a pervasive “listener” in the gallery. Phrases such as “Hey Siri, Alexa, or Google” were not going to be successful modes of initiating conversation with this interactive. It was quickly determined that a microphone activation or listener “button” that required to be manually triggered was preferred by users.
Developing that button led to a more refined schematic in expected user behaviour and options, and also enhanced inclusivity for users of all abilities.
Due to the CMHR’s delivery of all its content in both official languages (English and French), early models attempted to detect the language being spoken. However, due to the latency and pause of the system while processing this information, as well as testing with users who had heavily accented speech, it was determined that language detection through speech heavily degraded the user experience. Automated language detection was therefore removed as a feature in favour of a tactile solution, which was consistent with other blended physical/digital modalities featured in various galleries in the museum.
This resulted in the next prototype requiring language selection, microphone activation, and a method to non-verbally select from seven separate questions. The number of physical buttons seemed to keep multiplying, and a solution was needed that could greatly simplify the design and overall experience.
To provide a simple and intuitive interface, the design team wanted to give visitors a single physical element with which to interact (besides language selection). This solution evolved through a combination of several design concepts: the modality and tactile nature of a click-wheel, iOS picker menus (like those found in timer- or date-selection menus), and the volume control button commonly found in most car stereos.
A large, brightly lit button was selected to activate the microphone. As part of the base of the microphone button, the team designed and 3D printed a dial and gear-set that mounted around the bottom of the button, This provided a simple and intuitive way to scroll through the available questions. The dial is ribbed in a fashion that intuits its rotation. The button is further detailed with a backlit microphone icon, which aids in affirming visually that this exhibit component expects to hear the audio input.
Wrapping the central activation button with a 3D printed rotating dial enabled visitors to scroll through the supported questions directly, providing an alternate, tactile, non-verbal mode of interaction. When a visitor rotates the dial, the options scroll by on the screen with an audible click as it moves. When they pause rotating, the selected question is read aloud and they can press the button to activate it.
3D printing proved an invaluable resource for creating and prototyping this one-off, single button interface. Since the button already had a central shaft for mounting and we needed the dial to be able to rotate around the same central point, we chose to print a planetary gear style mechanism which allowed us to use an off the shelf rotary encoder mounted off-center to capture the dial’s rotation. A suitable base design was found on Thingiverse and modified to suit our requirements. (https://www.thingiverse.com/thing:138222)
In addition to the physical aspects of the tactile components, close attention was paid to the on-screen and audio prompts that provide visual and audio cues. A 4K 60″ screen was physically masked to create a square 1:1 ratio. The top portion of the display would provide “attract” cues as well as display 16:9 aspect ratio oral history videos in response to questions. The size of the screen allowed these video clips to appear at a human-scale.
The bottom portion of the display would provide live transcription, as well as display the “picker” non-verbal question selection when the rotation dial is activated. Live transcription was a key component in managing expectations for sighted users, as the person speaking would immediately see speech-to-text rendering that would indicate whether the system was understanding their speech. Conversely, all on-screen prompts are played back through recorded audio, so visitors that experience low-vision or blindness are also able to enjoy a complete experience.
Video playback is instantaneous on successful recognition of speech. Due to the number of subjects (12) and number of video responses, all of the questions have numerous combinations and it is quite difficult to replicate an experience when asking all 7 questions. In addition, the system has been programmed so that if a user asks the same question again and again, you will hear all 12 respondents without repetition. All videos have been closed-captioned and feature simultaneous sign language in ASL (English) and LSQ (French equivalent).
Due to answers being based on natural language and certain keywords, it is also important to manage visitors expectations surrounding the length of an answer. A thin progress bar extends along the edge of the footage to help the user gauge the length of the response. Responses can vary from a few seconds to almost 2 minutes, so it became important through testing to examine how best to communicate it in a way that did not disrupt the conversational nature of the interaction, while still adding benefit to the user.
VUI, due to its nature as a relatively new technology in the museum context (Devine, 2018), invites experimentation and provides an entry point for visitors to connect with subject matter in a unique way. Museum visitors will sometimes simply say “hello,” in which the interactive will respond. Sometimes visitors will ask questions about the exhibition that are not contained within what had been created, or what the temperature is outside, etc. and so the system politely tells the visitor that it can either a)not answer that, b) that it did not understand the question, and c) if it does not understand a second time, it offers suggestions on what to ask.
Accessibility and universal design often play a role in creating solutions that are holistic and unanticipated. As indicated earlier, a live internet connection is required to use DialogueFlow. So if there are network issues, the system may not be able to break out of the error message cycle if it cannot resolve what a person is asking. The solution can be found in the tactile dial, which allows users to non- verbally select questions. Due to the fact that the questions are pre-populated in the system and natural language recognition does not have to occur, if the network is down, the non-verbal, tactile solution still works.
Last but not least, graphic design played a critical roll in bringing together the entire experience. In developing the paper prototypes and testing various designs in-gallery with the public, we arrived at a system consisting of two-panels on either side of the screen, which contained minimal instruction but core information supporting the experience. A call to action, portraits of the 12 participants that you would meet when entering the experience, where they are from in Canada, and the suggested questions you could ask. The microphone, activation button, language selection, and on-screen messaging are fully accessible. All on-screen typography, transitions, contrast, and colours were selected using WCAG 2.1 AA standards.
Figure 16: Final graphic design of the printed portions of the interactive.
Observation Study Results:
For evaluation, the exhibition was divided into roughly two areas. Zone 1 was made up of the passive projected image displays, 2 tactile photos with interactive audio descriptions, seating, map, and introductory panels. Zone 2 encompassed the community wall including a small number of artifacts, the oral history Interactive, and visitor action center. Visitors were randomly selected for observation, excluding minors and member of organized tour groups. Visitors were tracked through the whole space, but the division between the two different content areas was useful for understanding where visitors were engaged.
Behaviours tracked were:
- Dwell time
- Number of stops, and stop time
- Engagement with interactives
- Level of interest
This study found that 76% of visitors stop at one or more of the interactives in this gallery. More specifically, the OHI has a 60% stop rate. Visitors who stop at interactives are divided into categories based on their level of engagement with the exhibition element:
- Level One: A visitor who stops and looks only
- Level Two: A visitor who stops and watches another visitor engage but does not use the interactive themselves, or who uses the interactive only once
- Level 3: A visitor who explores the interactive, manipulating it more than once
The Oral History Interactive has the highest percent of those who engage at the second level of any element in the exhibition, mostly watching a companion or another visitor use the interactive.
This experience was more successful at getting visitors to stop than any interactive in the museum’s core galleries.
Other highlights include from the observational study:
- Most visitors used the tactile dial to ask questions rather than verbal interaction
- Groups with young children were more likely to use the ask questions verbally
- Most users who tried to ask a question verbally got at least one error message
- Most of the “stop only” visitors for the OHI are reading the panel, rather than looking at the screen or tactile components
- Observations in other galleries indicates that digital interactives with physical components to manipulate are more successful in attracting interaction than screens alone – The OHI continues this trend.
For our final design, our goal was to give visitors of all abilities an equitable experience, and any accommodations made for one group should also be useful to others. Starting from the attract mode, many exhibits draw you in with a visual animation and written title, possibly an audio soundscape. For interactives, instructions are usually provided in text or icons on a screen or print.
Through testing the attract mode of the OHI, another barrier we discovered in entering the experience was prompting visitors to speak out-loud to the interface. The exhibition is on the 6th floor, and for visitors to encounter a VUI at this stage of their journey presented an interesting problem to solve.To improve the attract experience we added a presence detection feature that welcomes detected visitors bilingually and instructs them to select a language. This is presented on screen and audibly. By utilizing a small web camera above the exhibition, we use a light-weight facial recognition solution to trigger the welcome and instructions when they are looking directly at the installation.
Once selecting a language button, which is positioned similarly to other nearby interactives for consistency, the voice and screen prompt the user to ask a question with brief instructions on doing so. Once a question is asked or selected, the system finds the most similar supported question and picks an answer for that question from the list of recorded answers. Once a response completes, the user is asked if they would like to ask another question and the controls are repeated.
Like many interactive installations, modes of interaction are often at least somewhat novel to the visitor. Knowing that not all visitors are able to read the instructions on screen, it was important that anything found on screen that was important to the visitor’s experience also be presented audibly. Visitors that are looking for instructions can glance at them as needed, but for those without that option we felt it was important to present just enough instruction just-in-time for users.
The VUIs most visitors are familiar with are the voice assistants found increasingly in their pockets, on their wrists, and in their homes. These are usually triggered by a keyword or phrase to wake the assistant. The assistant then tends to continue listening between responses to the user as long as it feels the conversation is still active.
Since we are in a public space which can at times be quite noisy, we opted for a slightly different solution. Knowing that not everyone is comfortable with always listening devices, we chose to use a button that is held down to opt-in to listening. A loud space means that background voices can sometimes be picked up by the system, causing it to think the visitor is still speaking. Having it only listen while being held helps the system to know when a user has completed their input.
The voice first nature of this interactive provides a more natural way for visitors to interact with recorded oral histories through dialogue, as well as advantages for inclusive design. Perhaps more importantly, VUI lends itself to empathic experiences, which can create transformative storytelling opportunities for visitors (Peng, 2019). During oral history recordings, subjects often communicate their feelings and internal states through the voice, and as such, voice-only communication allows users of this system to focus their attention on the channel of communication most active and accurate in conveying emotions to others(Sarah, 2015).
It is important to recognize that not all visitors are comfortable or willing to voice a question in a gallery setting. Traditional modes of museum galleries of often perceived as quiet spaces, similar to libraries. This can further be compounded due to ability, being self conscious, an accent that is poorly supported by voice detection software, or simply not knowing what to ask.
Now that this VUI experience has been installed and open to the public, we continue to review user feedback to consider further augmentation. We are now investigating on next steps to open-source our solution from a software standpoint as well as providing an equipment list to allow other museums to begin to explore VUI as a solution for digital experience and engagement within cultural heritage.
As we look to the future, we can see VUIs being integrated into other experiences in gallery. Advances in voice processing have made local interpretation of speech more feasible at a high quality of understanding. This would allow eliminating reliance on external systems, increasing the robustness of the system.
There have also been experiments done using a webcam and computer vision to recognize sign languages to communicate with voice assistants as a way to give complete agency to members of the deaf community which seems like a natural extension to our inclusive approach. (https://github.com/shekit/alexa-sign- language-translator)
The VUI in Time to Act: Rohingya Voices was developed specifically for Rohingya oral histories, but it can conceivably be easily expanded into its own resource to give visitors access to our broader collection of interviews in a way that allows users to engage with intangible complex subject matter without complex menus or navigation.
Voice-enabled interfaces are propagating and redefining experience design across multiple channels. More that a design disruptor, this is creating infinite opportunities for cultural heritage to connect digital experience to natural forms of communication, creating interactions that feel more personalized, authentic and approachable. Time to Act: Rohingya Voices, is just one example of applying a sliver of a voice-enabled experience to open our audiences to new ways of experiencing museum content. Voice can compassionately portray subject matter, and provide agency to previously unheard voices. Visitors and users are asked to consider what action to take in an empathic methodology that is fully accessible. While preliminary research indicates that VUI is successful in attracting and retaining visitor attention, further research should be conducted on the emotional and cognitive effects of interacting with subject matter using this method.
Careful consideration leveraging the CMHR’s universal design approach also paved the way for innovation; allowing for verbal, non-verbal, tactile, visual, and auditory elements that support a fully accessible experience for audiences of all abilities. Employing voice as a primary input method has led to strong performance and evaluation of visitor engagement, pointing the way towards a more integrated and holistic user experience.
A special thank you to Samantha Kilpatrick, Data Analytics Specialist, for her insight, edits, and significant contribution of the Observation Study Results.
Devine, Catherine. (2018). “The Digital Footprint.” Digital Trends. Published April, 2018. Consulted December 2019. Available https://mw18.mwconf.org/paper/the-digital-footprint/
Peng, Jingyu. (2019). “How Does This Exhibition Make You Feel? Measuring Sensory and Emotional Experience of In-Gallery Digital Technology with GSR Devices.” Museums and the Web, LLC. Published April, 2019. Consulted December 2019. Available https://mw19.mwconf.org/paper/how-does-this-exhibition-make-you-feel-measuring-sensory-and-emotional-experience-of-in-gallery-digital-technology-with-gsr-devices/
Bania, Ashok. (2018). “Designing and Building for Voice Assistants (Alexa and Google Assistant): Guide for Product Managers.” Medium. Published Dec 20, 2018. Consulted December 2019. Available https://chatbotslife.com/designing-and-building-for-voice-assistants-alexa-and-google-assistant-guide-for-product-d2a171aa80d5
Whitenton, Kathryn. (2017). “Voice First: The Future of Interaction?” Nielsen Norman Group. Published Nov 12, 2017. Consulted December 2019. Available https://www.nngroup.com/articles/voice-first/
Rohan, Sarah. (2015). “From Witness to Storyteller: Mapping the Transformations of Oral Holocaust Testimony Through Time.” University of Michigan LSA. Published Winter, 2015. Consulted December 2019. Available https://lsa.umich.edu/content/dam/english-assets/migrated/honors_files/ROHAN%20S.%20From%20Witness%20to%20Storyteller.pdf
Kraus, Michael W. (2017). “Voice-Only Communication Enhances Empathic Accuracy.” American Physchological Association, Yale University. Published February 12, 2017. Consulted December 2019. Available https://www.apa.org/pubs/journals/releases/amp-amp0000147.pdf
Mikekic, S. (2002). “Towards Tangible Virtualities: Tangialities.” Museums and the Web LLC. Published April, 2002. Consulted December 2019. Available http://www.archimuse.com/mw2002/papers/milekic/milekic.html
Klein, Laura. (2015). “Design for voice interfaces: Building products that talk.” O’Reilly. Published Nov 5, 2015. Consulted December 2019. Available https://www.oreilly.com/content/design-for-voice-interfaces/
Gillam, Scott and Bergman, Benjamin. "The Power of Speech: Connecting Audiences in Dialogue with Voice User Interfaces." MW20: MW 2020. Published January 16, 2020. Consulted .