Developing Skills for Jibo

Andrew Rapo
December 20, 2016

The Theater of Jibo
Jibo is a unique platform with features that enable a new class of application: social skills. Jibo is built on a familiar mobile device architecture, the NVIDIA Tegra K1, which provides all the capabilities of a traditional smartphone or tablet. But it is Jibo’s form factor and physical animation system that enable his defining, social interactions. When designed and implemented correctly, Jibo skills evoke a psychological response that causes humans to attribute a ‘mind’ to this inanimate device. The key to making Jibo seem alive - and capable of social behavior - is animation, in the fullest sense. Jibo’s expressiveness is the product of both cutting edge technology and the ‘play’ that takes place on the ‘stage’ that his platform provides. Jibo is the lead actor in his own unique theater and skills are the play.

Choosing Skills for Jibo
The decision to author a new skill for Jibo should be motivated by the degree to which Jibo’s expressive capabilities are required to accomplish the goals of the skill. For example: Jibo will make an excellent motion-detector but he will shine as a watch-bot who can cheerfully recognize a friend or firmly insist that strangers identify themselves. Every skill requires that Jibo play a role that is consistent with his character. The skill designer’s first task is to identify the right part for Jibo to play.

The Three I’s
To best leverage Jibo’s strengths it is helpful to focus on three key principles:
Jibo initiates conversations when he sees you
Jibo individualizes his communications
Jibo interacts intelligently and in a human-like manner

Jibo’s ability to be aware of people when they are in his presence and to recognize them (by voice and face) allows him to initiate conversations. This unique capability sets Jibo apart from other devices and provides the foundation for his social behaviors. Jibo’s ability to positively identify users allows him to expand his “knowledge base” with information that will allow him to individualize his interactions with each member of the family. This information can include preferences, details about the relationship between family members, or the fact that Mom used the recipe skill to make chicken pot pie. By leveraging initiation and individualization, developers can craft a rich social interaction that cannot be achieved with other voice-driven and flat-screen devices. 

The Script
All Jibo skills start as a script using a format called a K-Script. Like a screenplay, the K-Script describes the setting, actors and actions that make up a skill. And like a screenplay, K-Scripts describe not only the observed actions, but the unobserved (off-stage) actions. The script should capture the essence of the skill’s core interaction and include enough detail to reveal the complexity and scope of this interaction. Even simple skills will have more than one important interaction and a separate script should be written for each of these. The script does not need to specify every use case and edge case. In fact, the power of the scripting process is that it can quickly produce a sketch that will allow the designer/developer to communicate a vision for the skill.

For example: Part of a Recipe skill K-Script might look like this:


For a good article about K-Scripts, see:

K-Scripts: The fastest and most flexible way to articulate a user experience
(http://www.bladekotelly.com/writing-blog/2016/10/6/k-scripts-the-fastest-and-most-flexible-way-to-articulate-a-user-experience)

The Voice User Interface (VUI)
The illusion of social behavior is greatly enhanced by voice communication. For a voice-driven application to be effective it must be designed from the ground up with the VUI as the primary interface. The VUI is the the foundation - the skeleton - of every Jibo skill. A voice interface can be broken down into modules, each of which is designed for a specific interaction that then leads to the next. For Jibo skills, these modules are composed of MIMs and NLU rules.

DevelopingSkills_Flows.png

MIMs and NLU Rules
To manage the complexity of dialog with Jibo, each module of dialog - each stage of the conversation - has a focused domain that is defined by a natural language grammar. The latest Jibo SDK provides all the tools necessary to design modular dialog including Natural Language Understanding rules (NLU rules) for extracting grammatical meaning from a user’s utterances, and Multimodal Interaction Modules (MIMs) which use the rules to make sense of things that users say. Although voice (audio) is the primary mode of communication with Jibo, his inputs also include a touch screen and head-touch sensors. MIMs can leverage all of theses modes for each module of dialog. 

Implementing the Rules
Each dialog interaction described in a K-Script has a grammatical domain. For example, when Jibo asks a question like, “Do you want me to show you recipes that feature chicken?” the domain is limited to “yes” or “no” answers. The NLU rules need to match all likely expressions of yes or no, including “yup”, “maybe not”, etc. This is such a common scenario that the SDK provides a ready-to-use yes/no rule. But when Jibo asks, “What kind of recipe are you looking for?” the domain is more broad. The rules need to match answers (user utterances) like “Something with chicken,” or “Actually, let’s go with beef instead.” In the context of this example module of dialog, the rules can probably ignore utterances about the news or the weather and treat them as exceptions, but that is what needs to be decided during the rules-authoring process. 

The SDK’s NLU Rule Editor allows rules to be tested while the are being written.

DevelopingSkills_Rules.png

The goal of the NLU system is to extract the semantic meaning from the user’s utterances and make this information available to the skill. A good way to approach the rule-creation process is to imagine all the kinds of natural, conversational utterances that Jibo may need to recognize in the context of the skill. For a recipe skill this process might start with a list of general utterances like:

  • Hey, Jibo. I need help making something for dinner
  • Hey, Jibo. Do you have a recipe for Chicken Pot Pie?
  • Hey, Jibo. I need a chicken recipe that I can make in under 30 minutes
  • Hey, Jibo. I would like to make something with beef.
  • etc.

The meaning of these utterances can be distilled into several discrete pieces of information:

  • An indication that the user is talking about recipes
  • An indication that the user has specified a main ingredient (i.e. chicken)
  • An indication the the user has some time constraints (i.e. 30 minutes)
  • etc.

The NLU system takes the user’s utterances as input and uses the rules to generate an NLU parse that includes all the semantic tags identified by the rule and values for those tags. Starting with a good list of utterances and a list of required semantic values simplifies the rule-writing process.

Implementing the MIMs
MIMs are used when Jibo needs to prompt the user. For example, if the user says, “Hey, Jibo. I need to pick a recipe for dinner tonight.” Jibo will launch his recipe skill and activate a MIM that presents a follow-up prompt like, “Sure, Andrew, what main ingredient should we use?” The MIM will then match all answers against the rules for this module. So MIMs require a rule file and a set of prompts. For example, a typical Question MIM will have three main prompts:

  • An Entry Prompt - Asks the questions
  • A NoInput Prompt - Re-asks the question if no response is heard
  • A NoMatch Prompt - Re-asks the question if an unrecognized utterance is heard, typically with helpful instructions.

A helpful NoMatch prompt might look like: “There are recipes for chicken, beef, pork, and fish or you can ask for vegetarian recipes. What main ingredient should we use?”

When all the MIMs that are specified in the K-Script(s) are ready, they are assembled into dialog flows using the Flow Editor.

Link to MIMs post (https://developers.jibo.com/blog/mims)

Flows and the Flow Editor
One of the best ways to design a dialog interaction is with a flow chart tool. Dialog tends to be very stateful with each MIM (module) acting as a branching point that leads to another MIM. Translating the logic described by a flow chart into code (and then back into flow charts) is a straightforward process, but it can be error prone. To minimize the need for this kind of manual translation, the Jibo SDK includes the Flow Editor. Using the Flow Editor, dialog can be described visually and then automatically translated into JavaScript.

The Flow Editor generates flows (.flow files) which can be used to describe a complete dialog interaction or to encapsulate a reusable section of dialog.  A flow can be invoked as a sub-flow from within another flow, making flows a powerful way to modularize the implementation of a skill.

Although flows can be used to implement any logic required by a skill, the best practice is to keep as much logic as possible in JavaScript (TypeScript) classes, and use flows only when MIMs are involved.

Link to Flows Post (https://developers.jibo.com/blog/flows-a-dungeon-example)

Character AI™
To ensure that Jibo moves and behaves in a lifelike and consistent way no matter what skill is running, the SDK includes Character AI features like embodied speech. The embodied speech system automatically turns text prompts into compelling performances by dynamically creating body and screen animations that match timing and meaning of the text. These animations are synchronized with the text-to-speech (TTS) system that renders Jibo’s voice. Embodied speech can be fully automatic but developers also have the option to craft a performance by including embodied speech markup tags (ESML tags) in the prompt text. These tags can be used to insert custom animations, sound effects, and timing into the performance. For example, this prompt includes an ESML tag to include a sound effect:

<sfx name='SSA.embarrassed.01'/>Maybe I'm still groggy from the trip. Will you tap yes or no on my screen, to tell me if you're ${owner}?

In this next example, the <anim> tag allows this prompt to play in parallel with a “celebration” body animation:

<anim path='animation/Celebrate_07.keys' nonBlocking='true'/>Would you like to play a game?

Embodied speech is one of many Character AI modules that automatically handle the more complex aspects of Jibo’s performances.

The Team
A skill development team will look a lot like a mobile game development team with the addition of a VUI Designer and a Speech Engineer. Like many game teams, the skill team will very often also need an animator for both 2D and 3D animation and a writer. For small teams, individuals will likely wear multiple hats. Some of the key roles (or areas of expertise) on a skill development team include:

  • Producer
    • The ‘director’ of the skill
    • Keeper of the creative vision
    • Responsible for executing the skill according to the shared vision
    • Responsible for budget, schedule, planning and delivery
  • Engineer
    • The technical architect of the skill
    • Responsible for programming the skill
  • Writer
    • Responsible for writing dialog, scripts, expressing Jibo’s character/voice
  • VUI Designer
    • Responsible for the voice user interface
  • Speech Engineer
    • Responsible for leveraging speech technology, creating NLU rules
  • Designer/Artist
    • Responsible for the user experience, visuals, look and feel, GUI
  • Animator
    • Responsible for 2D and 3D animation
  • Technical Artist (artist/programmer)
    • Responsible for integrating and optimizing media assets
    • Rapid prototyping of design ideas

Process
A good process for developing a skill will include milestones for each of the key layers discussed above.

Milestone 1 - The VUI Skeleton
In the lifecycle of a skill, the first major milestone is the VUI skeleton, a complete VUI-only implementation of the skill’s navigation with functional rules, MIMs and flows. Nothing beyond the VUI navigation needs to be functional at this point.

Once the VUI skeleton is ready (approved) the skill can be ‘fleshed out’ with GUI features, API calls, animation, audio, application logic, etc.

Milestone 2 - The Functional Alpha
The Functional Alpha milestone should representative examples of all of the skill’s main features. For example: For a recipe skill, this should include recipe selection, presentation of ingredients, VUI and GUI navigation through recipe instructions, presentation of media  (video, audio, photos), body animation, sound effects, and core application logic.

Milestone 3 - The Social Beta
The Social Beta milestone should make use of the KnowledgeBase, PersonID, and integration with Jibo’s core Greetings skill (via the KnowledgeBase) to enable proactive, personalize follow-up. The Social Beta should also exhibit all of the expressive social cues and ‘theatrical’ presentation that the skill requires.

Milestone 4 - The Connected Release Candidate
The final milestone in the lifecycle of a skill is a release candidate that includes any special account integrations, messaging integrations, sharing integrations and analytics integrations. Release candidates should be thoroughly QA’d and ready for final acceptance and validation.

Usability Testing
At each milestone, appropriate usability testing should be conducted so that feedback can be incorporated into the next deliverable. This is important because with a new technology like Jibo, many design decisions - especially with the VUI - will be best guesses. Often, an interaction that seems clear and straight-forward to the designer/developer will turn out to be unintuitive and confusing to the user. In addition to testing designs, the usability process tends to inspire innovative solutions.

QA
Jibo skills present some unique challenges for QA testing. Unlike mobile app interactions, which are designed to be 100% the same from session to session, Jibo skills are designed to have variability in every aspect including: timing of body animations, permutations of TTS prompts, Character AI (algorithmic/statistical) decision-making about when to initiate and what dialog to prioritize, etc. By design, no two interaction sessions with Jibo will proceed in exactly the same way. It many cases it will be valuable for the skill developer to incorporate a mechanism that allows the statistical decision-making to be locked down - to make thorough regression testing possible. The SDK will include generalized tools and examples to make testing easier.

The Art of Skill Development
The art of developing Jibo skills is still in its early days and will evolve rapidly. This post shares some of the insights and best practices that have been useful so far. As the Jibo developer community grows, it will be exciting to see what new ideas and techniques emerge.

Andrew Rapo
Executive Producer, Business Development & Marketing

Become a Jibo developer