Voice interfaces for video

Voice is the future of video, or at least something to which industry watchers should pay attention. That was one of the takeaways from the DTG FutureTech seminar in London. While some might still feel uncomfortable talking to their television, it may come more naturally to younger viewers.

Ronan de Renesse of Ovum, soon to be known as Omdia, spoke of the increasing popularity of smart speakers and voice assistants to control television. Apparently, 11% of those with them in the home control their television entirely through voice.

That seems surprisingly high, but the numbers are self-reported, so may reflect the perception of some users rather than reality. After all, changing the volume or muting the audio is not necessarily best accomplished by talking to the television.

However, voice is well suited to searching for something to watch. Media companies need to develop voice experiences, but they need to work across multiple ecosystems, including smart speakers and operating systems from Amazon, Google and others.

The future of smart speakers may also be smart displays, as they provide secondary displays that can offer user feedback.

Patrick Bryden from TiVo talked about conversational media, moving beyond simple command and control to a more interactive dialogue.

TiVo is behind the voice control available with Sky Q in the United Kingdom. This can work remarkably well for simple search and navigation, when you already know what you are looking for.

It is also capable of a more conversational mode, maintaining context to understand a series of interactions that can progressively refine a selection.

This requires a much more sophisticated approach to natural language understanding, as well as knowledge graph that captures complex relationships and current contexts to allow inferences to be made.

That potentially addresses typical use cases when users have some idea about the sort of thing that they want to watch but do not have a particular programme in mind.

It well worth paying attention to this, given the frustration that many users face with rows of choices of titles, each represented with a single cover art image.

People who talk to their television apparently watch more and they are less likely to churn, much less likely. Subscriber churn among regular voice users is reportedly around 1%, which is significantly lower than industry norms of around 10% per year.

Another interesting development is the use of voice to recognise younger users, together with other cues like time of day and programme genre or rating. It is possible to determine by pitch whether a young child is speaking. If they are asking for suitable programme, they can automatically be directed to a kid zone, providing a safer viewing environment, until an adult voice requests something else to watch.

Voice prints could potentially remove the need to login with a particular profile. That requires a level of trust. It is not clear that everybody is entirely comfortable with the idea of technology listening to them, or with talking to their television.

It seems that there is a need for more anthropological research about how people engage with such technologies, beyond early adopters that delight in them and may be more forgiving of their limitations.

There is some evidence that young people find it more natural to speak to technology. Recent research on media literacy from the communications regulator Ofcom suggests that use of smart speakers among aged 5-15 increased from 15% in 2018 to 27% in 2019. It was 20% among those aged 5-7, 25% for those 8-11, and 36% for those aged 12-15. 37% of homes with children of these ages had a smart speaker, rising to 43% for those with children aged 12-15.