Andrew Rapo
October 12, 2016

Jibo Skills are built around voice interactions. We refer to the Voice User Interface as the VUI. The goal of VUI development is to enable dialog. In the early versions of the Jibo SDK, the VUI toolkit included the Listen behavior (speech recognition) and the TTS behavior (text-to-speech). With these simple building blocks it is possible to create rich dialog experiences, but it is a lot of work. To simulate natural dialog, there must be mechanisms for introducing variability, handling errors, and coordinating VUI and GUI interactions. To remove this burden from developers and to provide consistency across skills, we have created MIMs.

MIMs are Multimodal Interaction Modules. As the name implies, MIMs handle interactions that have more than one mode, including voice (VUI) and touchscreen (GUI) modes. When prompted with a question like “Do you want to share this photo?” the expectation is that the user can answer either by saying “yes” or by tapping a button on Jibo’s touchscreen. Implementing this without MIMs is hard; MIMs make it easy.

Anatomy of a MIM

The MIM Behavior is a complex state machine that is configured using JSON in the form of a .mim file. The .mim file determines:

  • The MIM’s type: Question, Announcement, Optional Response
  • The speech recognition rule used to parse the user’s utterances
  • The main (Entry-Core) TTS prompt(s)
  • Prompts for use when Jibo does not hear a response (NoInput)
  • Prompts for use when Jibo hears something but cannot understand it (NoMatch)
  • Sample utterances, which are used for automatically-generated GUI controls
  • And more.

The raw .mim file looks like this:

	"mim_type": "question",
	"rule_name": "rules/en-us/lights/YesNo.fst",
	"sample_utterances": "yes,no",
	"timeout": 6,
	"num_tries_for_gui": 1,
	"es_auto_tagging": true,
	"prompts": [
			"prompt_category": "Entry-Core",
			"prompt_sub_category": "Q",
			"index": 1,
			"condition": "",
			"prompt": "The lights in the ${rooms} are on. Would you like
me to turn them off?",
			"media": "TTS",
			"prompt_id": "Entry-1"
			"prompt_category": "Errors",
			"prompt_sub_category": "NI",
			"index": 1,
			"condition": "",
			"prompt": "So, would you like me to turn off the lights in
the ${room}?",
			"media": "TTS",
			"prompt_id": "NoInput-1"
			"prompt_category": "Errors",
			"prompt_sub_category": "NM",
			"index": 1,
			"condition": "",
			"prompt": "Sorry, ${user} I didn't get that. Would you like
me to turn the ${room} lights off?",
			"media": "TTS",
			"prompt_id": "NoMatch-1"


Editing .mim files directly is an option, but to make life easier we have created a MIM Editor and integrated it into the SDK’s Atom package. The MIM Editor looks like this:

Variability: Multiple Prompts and Conditions

To make dialog feel natural, MIMs can be configured with any number of prompts. When the MIM Behavior executes, it picks one of the available prompts from the appropriate category. The prompt can be chosen randomly or using logic supplied in the Condition field. 

For example, multiple Entry-Core prompts can be defined with Condition fields that filter them based on available data. In this example there are two Entry-Core prompts that are appropriate for situations where Jibo has not identified the user (Entry-1 and Entry-3). In these cases he will choose one of the two randomly. However, if the user is identified, Jibo will use the Entry-2 prompts. Because an arbitrary number of Prompts+Conditions can be defined for any MIM, the developer has a straight-forward way to add lots of prompt variability to dialog interactions.

Multiple Modes: VUI + GUI

By defining sample utterances in the MIM configuration, Jibo can automatically generate Graphical User Interface (GUI) controls that can be used as an alternative to voice interaction. The CheckLights MIM (above) uses a YesNo rule to process the user’s utterances. By providing “yes” and “no” as sample utterances, Jibo is able to present Yes and No touchscreen buttons on his screen. Tapping them produces the same result as saying “yes” or “no.” 

In this example, the failures to Trigger GUI field is set to 1. This tells Jibo to present the GUI only if there is a NoMatch error, meaning that the user is having trouble communicating with Jibo via voice. When this value is set to 0 Jibo will always present the GUI.


Like the Listen Behavior, MIMs are configured with a speech recognition rule to parse the user’s utterances. When the cloud-based automatic speech recognition (ASR) system returns a transcript of the user’s utterance, this transcript is passed to Jibo’s on-board Natural Language Understanding (NLU) parser. To determine the semantic meaning of the utterance, the parser looks for patterns that are defined in a .rule file. For example, a simple YesNo.rule file might look like this:

TopRule = ($* $CONTROL $*) {intent = CONTROL._intent};

                $YES {_intent=YES._intent} |
                $NO {_intent=NO._intent}

                yes  |
                yeah |
                sure |
                yep |
                certainly |
                absolutely |
                definitely |
                ( i (think | suppose | guess) so) |
                ( i do ) |
                okay |
                fine |
                please |
                (go ahead) |

NO =
                no  |
                nope |
                not |
                don\'t |
                ( do not ) |
                ( i\'m good )

This example rule is looking for different ways a user might say “yes” or “no,” including “right,” “sure,” “nope,” “I’m good,” etc. Rule files use a special syntax that requires more of an explanation than this post can provide (see https://developers.jibo.com/sdk/docs/reference/jibo-atom-package/speech-recognition.html for more detail), but in this simple case the desired output is a variable named “intent” that will either be set to “yes” or “no.” This can be seen in the first few lines:

TopRule = ($* $CONTROL $*) {intent = CONTROL._intent};

                $YES {_intent=YES._intent} |
                $NO {_intent=NO._intent}

In this rule, a CONTROL subrule is defined with two additional subrules, YES and NO, which define the patterns the rule is trying to detect. The TopRule establishes that the CONTROL phrase(s) can have anything before or after them via the wildcard symbol ($*). Then the patterns for each CONTROL are defined. When a control pattern is recognized, the NLU parser returns a JSON object containing the results, like this:

        "Input": "i'm good",
        "NLParse": {
            "intent": "no"
        "heuristic_score": 9,
        "index": 0

The MIM provides access to this result, which can then be used to determine what happens next. The best way to get familiar with rules is to try them in the SDK’s interactive Rule Editor, which looks like this:

When you type a phrase into the input field in the right pane ( i.e. “i’m good”) the result object is displayed. VUI interactions are the core of every skill, and good speech rules are the key to successful VUI interactions. The interactive Rule Editor can help every skill developer master the art of writing rules.

Error Handling

Just like in human-to-human dialog, voice interactions with Jibo will sometimes include misunderstandings and recognition errors. MIMs recover from these errors as naturally as possible. As mentioned above, MIMs define a way for Jibo to respond when he asks a question but can’t make sense of the answer. By defining NoInput and NoMatch prompts, Jibo has appropriate ways to re-ask and/or re-phrase the question. Using the CheckLights MIM an interaction might go like this:

User: Hey, Jibo. I am heading out for a while.  See you later.

Jibo: <Entry-1> The lights in the living room are on. Would you like me to turn them off?

User: [Thinking, but no answer]

Jibo: <NoInput-1> So, would you like me to turn off the lights in the living room?

User: [dog barks over user’s “Yes” response.]

Jibo: <NoMatch-1> Sorry, Cynthia I didn't get that. Would you like

me to turn the living room lights off?

User: Yes, please. See you later.

MIMs also provide global VUI controls that give the user an opportunity to cancel an interaction or ask Jibo to repeat a question:

User: Hey Jibo

Jibo: Hey, Roberto. Do you want to play a quick trivia game?

User: Sure. That sounds fun.

Jibo: What fourth largest island in the world shares its name with a film that both Chris Rock and Jada Pinkett Smith have credits in?

User: Whoa. I am going to need you to repeat that, please.

Jibo: Sure. Here goes: What fourth largest island in the world...

[door bell rings]

User: Cancel that.

Jibo: No problem, Roberto.

Embodied Speech Markup

The most important benefit of MIMs is that they make it easy to use all of Jibo’s expressive capabilities. In addition to controlling TTS prompts and the GUI, MIMs provide a way to synchronize body animations and sound effects with prompts. We refer to this as Embodied Speech. For example, this prompt includes Embodied Speech markup language (ESML) tags to include a sound effect:

<sfx name='SSA.embarrassed.01'/>Maybe I'm still groggy from the trip. Will you tap yes or no on my screen, to tell me if you're ${owner}?

In this next example, the <anim> tag allows this prompt to play in parallel with a body animation:

<anim path='animation/Celebrate_07.keys' nonBlocking='true'/>Would you like to play a game?

ESML can also be used to control the prosody of prompts (the patterns of stress, intonation, and timing). This example shows how to use the <break> tag to control timing in conjunction with some advanced <anim> tags (‘size’ refers to time in seconds).

Hi Nancy, <break size='0.5'/><anim cat='thinking' filter='processing'> looks like you have a busy morning.</anim> You have meetings at 9 30, <break size='0.5'/> and 11. <break size='0.7'/> <anim cat='emoji' filter='fireworks' nonBlocking='true'/> Don’t forget, the 4th of July fireworks are at 9 tonight.

We are excited about MIMs because they dramatically reduce the effort required to make rich dialog interactions. Combined with Embodied Speech, developers can easily turn simple dialog prompts in to expressive Jibo performances.

Questions/Comments? Feel free to join our forum discussion here.

Andrew Rapo
Executive Producer, Business Development & Marketing