Voice assistants – such as Siri and Alexa – are becoming viable because the accuracy of speech recognition has increased significantly through machine learning.
I’m concerned about the political ramifications of the technical underpinnings of voice interfaces. Voice recognition has advanced through access to lots of data, and machine learning. This means that for people to use this interaction medium, they need to be using tools provided by big providers like Apple, Amazon and Google. How easily will people be able to create their own voice interfaces without relying on a corporate provider? How easily will you be able to identify the thing that you’re interacting with, the logic it’s driven by, and who ultimately owns and controls the means of interaction? What is the product, and what generates value and to/for who?
Voice as a interaction medium has limitations. Voice requires (quiet) private space, and for you to be able to speak in an understood language. And so far voice interactions have focused on information or commerce transactions.
It’s worth reflecting on what problems we’re trying to solve when we pioneer a new interaction paradigm. What value are we trying to realise? Are transactions the big challenge we need to solve? (Would interaction approaches based around critically managing and engaging with information flows, or social connectedness, be different?)
The dominant method of interacting with computers so far has been mechanical, with feedback and state communicated visually by the computer. Thinking more broadly, the interaction channels we use to interact with computers could be mechanical or oral, and each of our senses could be a channel for the computer to feed back information. And we could design for more than one type of input and feedback at one time. This Smashing Magazine piece on multi-modal interfaces outlines interfaces that take visual and audio elements:
“When we’re designing an interface, if we know the context, we can remove friction. Will the product be used in the kitchen when the user’s hands are full? Use voice control; it’s easier than a touchscreen. Will they use it on a crowded train? Then touching a screen would feel far less awkward than talking to a voice assistant. Will they need a simple answer to a simple question? Use a conversational interface. Will they have to see images or understand complex data? Put it on a screen”
So rather than just focusing on voice input, it might make sense to think about how to structure our data to be agnostically accessible. Agnostic interaction channels don’t privilege one way of perceiving or interacting with the world. Now that more of the population is computer-literate, we might be able to relax skeuomorphic baggage. Skeuomorphic design’s visual language helped communicate but also constrained the behaviours and interactions we could design, because they had to be intelligible as physical metaphor. Old metaphors or expectations, like “saving” files by clicking on a floppy disk icon, could be seen as blockers to thinking more broadly about interaction.
Perhaps the next paradigm of interaction design is to transcend interaction – to read and service our intent directly. This feels a way off, but here’s an early proof of concept
If we remove friction and translation from our interactions, what are we left with? Our own unmediated desires. Friction in transmitting intent can be a good thing – a chance to reflect and exercise deliberate control, rather than being driven just by desires. The task of cultivating deliberate intention, and making conscious decisions, rather than acting on impulse, is a design problem and a spiritual one.