Recently, domestic service robots (DSRs) hardware is being standardized and many studies have been conducted. However, in most DSRs, the communication ability is still very limited. Existing instruction understanding methods usually estimate missing information only from non-grounded knowledge; therefore, whether the predicted action is physically executable or not is unclear.
In this work, a grounded instruction understanding method is introduced to estimate appropriate objects given an instruction and situation using a Generative Adversarial Nets-based classifier using latent representations.
In this work, we worked on the case where the target area is not specified. For instance ``Put away the milk and cereal.'' is a natural instruction where there is ambiguity on the target area, considering daily life environments. Conventionally, this instruction can be disambiguated from a dialogue system, but at the cost of time and cumbersomeness. Deep learning techniques are used to predict the task feasibility according to HSR physical limitations and the space available on the different target areas. Instead, we propose an MLU approach, where the instructions are disambiguated from the robot state and environment context. We developed MultiModal Classifier Generative Adversarial Network (MMC-GAN) to predict the likelihood of the different target areas considering the robot physical limitation and the target clutter.
Our first MLU method, LCore fuses motion, visual and linguistic inputs for manipulation tasks in an artificial environment. This paper proposes a method that generates motions and utterances in an object manipulation dialogue task. The proposed method integrates belief modules for speech, vision, and motions into a probabilistic framework so that a user's utterances can be understood based on multimodal information. Responses to the utterances are optimized based on an integrated confidence measure function for the integrated belief modules.