You are Head of Digital at a forward-thinking business and you want to explore how you can use a Voice User Interface (VUI) to offer your services, make your customers happy and increase brand awareness? You’ve come to the right place.

But before we get into what to be aware of when building your own Conversational, Zero or Voice UI, your skill, action or app, let’s take a step back and go over the trajectory of our newly found, yet to be beloved tech.

The four horsemen

In an attempt to make a sweeping oversimplification of how technology behaves over time, let’s just focus on the following four phases:

  • Utility
  • Information
  • Entertainment
  • Transaction

This is no iron rule, golden rule or even a rule of thumb. This is just something that I can conveniently hang some ideas off, and hopefully give you some inspiration to start driving your business forward.

Alexa

Why voice?

You’re at work, focused on the task at hand. Slack pops up and you check what’s going on. Someone posted a hilarious GIF in #random. And someone else added you to a Google Slides deck. Let’s check it out. Ah, it’s just the start of a project. You also get an email notification that you were added to that Slides deck. You go into your email client and delete the notification, because, you know, inbox zero. Right, you’ve done a lot already, let’s make a cup of tea and then back to work. Which tab were you in again? This one? Nah, that’s the article you were intending to read. What’s it about again?

Sound familiar?

Interruptions add up. Estimates based on a UC Irvine study show that refocusing your efforts after just one interruption can take up to 23 minutes. That same study found that the average worker switched tasks on average every three minutes. That’s a lot of lost time and energy.

One of the potentially life-changing features voice interfaces have is that they, while not entirely distraction-free, are certainly a lot less invasive than traditional close-proximity UIs.

The benefit of not having to switch tabs, apps or move away from your standing desk means less distraction, less time spent recuperating from said distraction and more time being productive. Maybe you even get to go home on time today!

This brings us to the first phase.

Utility

The Amazon Alexa Skills Store is full of handy utility-based tools. One for dimming the lights, another for playing a song, a third for calling a friend and even one for finding the right recipe for tonight’s dinner party.

There are hundreds of handy tools to help you in your day to day life without having to lift a finger. And there’s many more to come. Think of submitting your meter reading just by reading out the number. Seems banal, but at the moment, that same simple action requires you to take a picture of your meter, close the camera app, try to locate the app of your electricity supplier (let’s call the company ‘Plug’ for now), open it up, find where you need to input the meter reading, switch to your photo app, remember the first few digits and switch back to Plug. Then switch back to Photos and remember the remaining few numbers, switch back to Plug and hit submit.

Instead, you could say: “Alexa, tell Plug my meter reading is 123456789”. That’s at least a few minutes quicker and a lot less disruptive. Interruptions add up.

Many more utility-based actions, skills and apps will come on the market and they will all help to make your life more efficient.

Informational

Just like our phone apps before it, and the web before that, we’ll see a lot of purely informational VUIs in the near future. Think of how Google has become one of the world’s most successful companies by primarily focusing on providing information. But also, think of how many websites there are whose sole business model is to provide information.

At Hi Mum! Said Dad, we’ve been cooking up a Skill for BBC Good Food, one of the world’s largest English language recipe provider, and a company whose business model has been built around providing information, in the shape of delicious recipes.

Our recipe Skill’s UX is already pretty tasty and will change the way that you interact with the BBC Good Food recipes that you have come to love — you should give it a try once it’s launched. Now, imagine how the other services that you use on a daily basis might work when it is purely audio-based. What about Wikipedia, Skyscanner, Booking.com, Udemy, Citymapper, AirBnB? How could they all look?

Entertainment

Remember Angry Birds? The global sensation that utilised the then-new iPhone touch screen and its unique features to create an addictive and fun game played by at least one world leader. The point is that any new technology will find a way to a pure entertainment application and when it does so successfully (unique feature + mainstream adoption) it can blow up spectacularly. A few features are required to fully make voice a proper casual gaming platform like response-time measurement, spatial awareness, pitch recognition and more. But let’s say we’re there, how would Tetris translate to voice-only? Bust-a-bubble? Pong? What about guitar lessons powered by machine learning (‘Awesome riff! Now try playing this “?”?’), debating and presentation skills, foreign language learning… which one will be VUIs killer app?

Transactional

It took a good few years for e-commerce to truly take off, but when it did, it cemented the internet’s ability to infiltrate into everything we do. I remember buying only small things on the web and larger amounts in shops, largely due to security worries. But within only a few years, that has entirely reversed. I still buy stuff in shops, but only small things like lunch. Anything beyond £20 and I would much rather buy online.

Voice’s transition into e-commerce (or maybe v-commerce?) will be much quicker now that the security hurdle has been largely overcome. Amazon is rolling out Amazon Pay for charities, and will, of course, be looking to utilise their enormous fulfilment enterprise to make purchases truly frictionless for their customer base. Let’s just hope the invocation isn’t ‘I want it’ because my 3-year-old will have emptied my bank account before noon.

All tech giants are already perfectly set up for all of the above. Amazon has the edge right now simply by being the first to market voice on a global scale (if you don’t count Siri). But pretty soon, Google will start using their unrivalled data analysis prowess (years of predictive search training will come in handy) to deliver a service that will most likely be seamless across all devices with your Google account and will hopefully give us voice developers the chance to do truly remarkable things.

At the moment though, Amazon’s VUI ecosystem is the most developed. Many are developing for it first but are making sure it’s easily transferrable to different platforms in the future by using software like Dialogflow (owned by Google, but outputs to Alexa and chatbot services).

So, onto what you’re really after: some handy takeaways for designing your script.

Designing a Voice UI

The following tips and tricks will apply mostly on the first two stages, Utility and Information, and to a lesser extent, Transaction. Give us enough claps and I might devote the next chapter to Entertainment ;-).

Hunt and gather, but mostly gather

As we’ve established, you are Head of Digital at a forward-thinking company and chances are you offer your customers something special. So you’re confident you know your users’ requirements, right? Wrong. You probably have a good idea, but it is worth taking a step back and finding out what exactly your users would want from your VUI. Treat it as an MVP; your company might offer a plethora of products and services and it would be foolish to try to offer everything you do on the first launch. First of all, this would be a costly endeavour, but secondly and more importantly; not all of your services might fit neatly into a voice-only application, nor might they even need to be provided via voice.

Treating your VUI as an MVP means you need to cut the fat and focus only on one special aspect of your offering. Not only will you be able to ship fast, but you’ll also be able to gather data on whether this particular feature resonates with your existing customer base. This is a simple and established product design principle, but it is worth keeping in mind. VUIs can quickly balloon and before you know it, your feature-creep alarm is ringing non-stop.
Ask yourself the questions “What do my customers want from a voice-only interface?”, “Which features of my product do not require a visual component?”, “What advantages does a VUI give above my existing channels?” but also more fundamentally, “How does it help my business?”

You might need to go back to your personas to find out which of them would benefit most from your fresh new VUI. Bear in mind that voice-controlled speakers are generally community-driven purchases. Apart from early adopters, they are used by families at home, and, surprising or not, a slightly older demographic than is usual for tech innovations.

Script

Once you’ve established what your VUI should offer, you can start scripting. This is the fun, yet difficult part.

Reading out a sentence takes longer than looking at a picture or scanning a page, but we want to limit the time it takes a user to get where they want to be. So we need to shorten our sentences and ask the fewest possible questions. Generally, a VUI functions via a back and forth info gathering conversation. These take place in steps unless the user provides all the info in one shot.

Before you start designing a functional script, you layout something called a story frame. You can do this for products that end up completely visual like apps or websites, as the lowest fidelity form of prototyping. However, for VUIs, it is a useful way to group information blocks.

When designing for voice you need to keep in mind that while the story you are telling is in sequence, users should be able to skip steps, or interrupt it whenever they feel like it. Very few conversational UIs cater for this effectively, so you if you want to make a truly great experience for your users, allow them to navigate through your service at their own pace.

Note: There is no scrolling/scanning or deep-linking (yet), so a user will have to follow the path you layout for them. This can be used to your advantage, but also requires you to make the journey as easy as possible. Your users will have to tread that same path, or at least that same invocation+intent every time they use your service, so make it quick and enjoyable.

Happy Route

When designing a Skill or Action, it’s important to help users get to the information they require as quickly as possible, but before you can do so, you need to know how someone might go about searching for it.

These are called intents and they are combined with entities. Let’s imagine that you’re an airline looking to create a Voice Interface for booking flights. You might start with something like ‘Find me flights to Barcelona,’ in which ‘Barcelona’ is an entity, and ‘Find me flights’ is the intent.
Intents are things that the user can do with the app, e.g plan a holiday, find a recipe, buy a product, submit a meter reading.

This provides us with the first hurdle. There are many ways to ask for something and we need to map them all out. For example, “Give me flights to Barcelona”, “Do you have flights to Barcelona”, “I’d like to fly to Barcelona” or “I have a wedding to get to in Barcelona, can I get some flights?”. All in all, you can easily get to 30+ intents. And that’s only for a single entity! A user might also ask for “flights to somewhere hot in Spain” or “somewhere I can fly to in the south of Spain”.

Luckily, you’ve done your research in the previous step, so you know what your users are going to ask for.

Dead ends

The above assumes the Happy Route. This is the path of least resistance, the easiest, and shortest route to success. But of course conversation isn’t always that clear cut and you have limited resources to account for every possible interaction. So users will end up at dead ends if you’re not careful. It is hugely important to take these into account and allow the user to navigate away from their path without friction when they do end up at a dead end.

Imagine a user asks for a flight that takes under 4 hours. Let’s say your API doesn’t have the option to filter by duration and you read out a flight with a four-hour duration. To avoid frustration, you will need to let the user know you can’t filter by duration, BUT that you have found a different flight that they might still be interested in. So even if you don’t allow a certain option, with voice you will still need to account for the possibility that a user might ask for it and provide them with a suitable reply.

If your VUI doesn’t understand an answer, make it clear to the user. Handle scenarios where the user says something that is not recognised with “I didn’t quite understand that, could you please repeat?” Alternatively if they request a feature that isn’t supported, respond with ‘Sorry, I’ll be able to do that in a future update,’ so the user has transparency, and follow it up with a way back in, like “…would you like to start over?”or “…but I can help you with…”

Minimise choice

As mentioned above, speaking takes longer than scanning a page. If a picture speaks a thousand words, we’re going to lag behind. No one has time to listen to a full paragraph (unless that paragraph is the whole point of the app of course). So when you list out answers, stick to two or three responses in a sentence. You can follow this up with “Would you like to hear more?”.

This poses a problem for lists. While you might scan a list of ten or twenty items easily on your laptop, you need to make a decision which items in the list you want to read out to the user with your VUI. Filter by most popular, most recent or promoted — it’s up to you and your business needs.

But when you do, make sure to avoid any misunderstanding. “Here are some cat foods, ‘brandname1’, ‘brandname2’. Would you like to buy one?” may seem like a nice sentence, but what is the user going to reply? “…Which of these would you like to buy?” avoids ambiguity and gets you to your goal quicker.

Voice-enabled devices are getting better and better and pretty soon they will support better memory whilst switching apps, so you should keep that in mind in design. (“Welcome back [name], you were at step 3 of your booking process, do you want to continue or start a new search?” or “Hi there [name], I hope you had a good flight. Would you like to repeat the booking or go somewhere else?”).

Audio

Audio isn’t widely adopted yet, but it is a powerful distinguisher for your VUI. Think about ways to add pre-recorded audio to spice up the rather generic native voices. Think about the buzz TomTom created by adding celebrity voices to their GPS system. It’ll be impossible to book, say, Steven Toast for a few months to read out every possible word with enough variation to keep it interesting, but you can certainly pre-record a welcome message, sign-off or catch-phrase. If your app has a very focused purpose like in the meter reading example, audio can be used to record an entire script, giving your VUI a unique personality.

Wrap up

There are plenty of good articles here on Medium that layout the structure of a VUI, full of invocations, intents and entities and I will go into the details of designing a successful conversation in a next article, but for now, I hope I’ve given you an idea of what to think of when designing for voice and hopefully you gained some inspiration too!