Enhancing robot voice by implementing text-to-speech solutions

A core element of human-robot interaction is verbal communication between humans and robots, for which suitable text-to-speech (TTS) solutions are key. This bachelor thesis gives insights into suitable TTS solutions for a humanoid social robot.

Filip Bosankic, 2021

Bachelor Thesis, Institute for Information Systems, HSW FHNW
Betreuende Dozierende: Vivienne Jia Zhong, Theresa Schmiedel
Keywords: Text-to-speech, TTS solutions, Implementation, Evaluation, Mean opinion score
The Institute for Information Systems of FHNW School of Business possesses several humanoid social robots, such as Pepper, and Joey. These robots use different in-built text-to-speech systems. However, these solutions offer only one voice. This work aims to find a new TTS system, which will offer a better voice for the robot Joey in German language and a male and female voice. The experience of people interacting with Joey could be enhanced by identifying and implementing a more suitable voice.
This bachelor thesis was conducted in three phases. Phase one identifies the most relevant literature and introduces the subject. Phase two documents and summarizes requirements for a suitable TTS solution and conducts a market overview of open-source, off-the-shelf TTS solutions to identify the two most suitable solutions through a structured evaluation. In the process, non-suitable solutions were excluded, and the selection for implementation was justified. In the third phase, the two TTS solutions were implemented, tested and evaluated through an online survey (n=53).
The conducted market overview showed that there are only a few German open-source TTS solutions available on the market. Nevertheless, the implementation and perception of the two best identified TTS solutions could be demonstrated. The first implemented solution can generate a male and female voice without additional interfaces or internet connection. The second solution is based on deep learning and fulfills almost all requirements better than all other solutions. However, the TTS solution can only create a male voice and requires a locally running server. Therefore, this TTS solution was not evaluated in the online survey further. As a final result, a male and female created voice with the first implemented TTS solution was evaluated in comparison with the currently used TTS voice of the robot Joey. It can be said that based on the survey results there is yet no open-source, off-the-shelf solution on the market that can generate both male and female voices which are perceived by potential users better than the existing solution used by the robot Joey.
