Construct a Regionally Working Voice Assistant | by Sébastien Gilbert | Dec, 2023

Thank you for reading this post, don't forget to subscribe!

Ask an LLM a query with out leaking personal data

Sébastien Gilbert

Towards Data Science
Picture generated by the writer, with assist from

I’ve to confess that I used to be initially skeptical concerning the potential of Giant Language Fashions (LLM) to generate code snippets that really labored. I attempted it anticipating the worst, and I used to be pleasantly stunned. Like every interplay with a chatbot, the best way the query is formatted issues, however with time, you get to know the right way to specify the boundaries of the issue you need assistance with.

I used to be getting used to having a web based chatbot service all the time out there whereas writing code when my employer issued a company-wide coverage prohibiting staff from utilizing it. I might return to my previous googling habits, however I made a decision to construct a regionally working LLM service that I might query with out leaking data exterior the corporate partitions. Because of the open-source LLM providing on HuggingFace, and the chainlit undertaking, I might put collectively a service that satisfies the necessity for coding help.

The following logical step was so as to add some voice interplay. Though voice shouldn’t be well-suited for coding help (you need to see the generated code snippets, not hear them), there are conditions the place you need assistance with inspiration on a inventive undertaking. The sensation of being informed a narrative provides worth to the expertise. Alternatively, chances are you’ll be reluctant to make use of a web based service since you need to maintain your work personal.

On this undertaking, I’ll take you thru the steps to construct an assistant that means that you can work together vocally with an open-source LLM. All of the parts are working regionally in your laptop.

The structure entails three separate parts:

  • A wake-word detection service
  • A voice assistant service
  • A chat service
Flowchart of the three parts. Picture by the writer.

The three parts are standalone tasks, every having its personal github repository. Let’s stroll by way of every part and see how they work together.

Chat service

The chat service runs the open-source LLM known as HuggingFaceH4/zephyr-7b-alpha. The service receives a immediate by way of a POST name, passes the immediate by way of the LLM, and returns the output as the decision response.

You’ll find the code right here.

In …/chat_service/server/, rename chat_server_config.xml.instance to chat_server_config.xml.

You may then begin the chat server with the next command:


When the service runs for the primary time, it takes a number of minutes to begin as a result of massive information get downloaded from the HuggingFace web site and saved in an area cache listing.

You get a affirmation from the terminal that the service is working:

Affirmation that the chat service is working. Picture by the writer.

If you wish to check the interplay with the LLM, go to …/chat_service/chainlit_interface/.

Rename app_config.xml.instance to app_config.xml. Launch the online chat service with

Browse to the native handle localhost:8000

It is best to have the ability to work together along with your regionally working LLM by way of a textual content interface:

Textual content interplay with the regionally working LLM. Picture by the writer.

Voice assistant service

The voice assistant service is the place the speech-to-text and text-to-speech conversions occur. You’ll find the code right here.

Go to …/voice_assistant/server/.

Rename voice_assistant_service_config.xml.instance to voice_assistant_service_config.xml.

The assistant begins by enjoying the greeting to point that it’s listening to the person. The greeting textual content is configured in voice_assistant_config.xml, beneath the factor <welcome_message>:

The voice_assistant_config.xml file. Picture by the writer.

The text-to-speech engine that enables this system to transform textual content into spoken audio which you can hear by way of your audio output machine is pyttsx3. From my expertise, this engine speaks with a fairly pure tone, each in English and in French. Not like different packages that depend on an API name, it runs regionally.

A mannequin known as fb/seamless-m4t-v2-large performs the speech-to-text inference. Mannequin weights get downloaded when is first run.

The principal loop in voice_assistant_service.predominant() performs the next duties:

  • Get a sentence from the microphone. Convert it to textual content utilizing the speech-to-text mannequin.
  • Examine if the person spoke the message outlined within the <end_of_conversation_text> factor from the configuration file. On this case, the dialog ends, and this system terminates after enjoying the goodbye message.
  • Examine if the sentence is gibberish. The speech-to-text engine usually outputs a legitimate English sentence, even when I didn’t say something. By probability, these undesirable outputs are likely to repeat themselves. For instance, gibberish sentences will typically begin with “[” or “i’m going to”. I collected a list of prefixes often associated with a gibberish sentence in the <gibberish_prefix_list> element of the configuration file (this list would likely change for another speech-to-text model). Whenever an audio input starts with one of the prefixes in the list, then the sentence is ignored.
  • If the sentence doesn’t appear to be gibberish, send a request to the chat service. Play the response.
The principal loop in voice_assistant_service.main(). Code by the author.

Wake-word service

The last component is a service that continually listens to the user’s microphone. When the user speaks the wake-word, a system call starts the voice assistant service. The wake-word service runs a smaller model than the voice assistant service models. For this reason, it makes sense to have the wake-word service running continuously while the voice assistant service only launches when we need it.

You can find the wake-word service code here.

After cloning the project, move to …/wakeword_service/server.

Rename wakeword_service_gui_config.xml.example to wakeword_service_gui_config.xml.

Rename command.bat.example to command.bat. You’ll need to edit command.bat so the virtual environment activation and the call to correspond to your directory structure.

You can start the service by the following call:


The core of the wake-word detection service is the openwakeword project. Out of a few wake-word models, I picked the “hey jarvis” model. I found that simply saying “Jarvis?” will trigger the detection.

Whenever the wake-word is detected, a command file gets called, as specified in the <command_on_wakeword> element of the configuration file. In our case, the command.bat file activates the virtual environment and starts the voice assistant service.

The configuration file of the wake-word detection service GUI. Image by the author.

If you want to use the wake-word detection service for something else, you can edit the command.bat file to make it start whatever program you want.

When the wake-word is detected, the wake-word detection service emits a two-beep audio signal to indicate it was triggered. This quick feedback tells the user that they must wait as the text-to-speech and the speech-to-text models load in the computer’s GPU, which can take several seconds. When both models are loaded, the voice assistant service plays the greeting, and the conversation can begin. After the conversation ends, the voice assistant service terminates, and the wake-word detection service emits another two-beep signal.

When the three components are installed and tested, we can start the chat and the wake-word detection services. The wake-word detection service will constantly listen to the microphone. When the wake-word gets detected (“Jarvis?”), it will start the voice assistant service. You can then ask questions out loud and receive a spoken answer. When the end-of-conversation phrase (“Thank you and goodbye”) is detected, the voice assistant service ends.

Here is an example of an interaction with the voice assistant:

“Jarvis?” [beep-beep]

[Several seconds of silence as the voice assistant models load]

“Whats up! How can I make it easier to?”

“What are Saturn’s rings fabricated from?”

“Saturn’s rings are primarily composed of ice particles, with the bulk being made up of water ice. Nevertheless, there are additionally smaller quantities of different ices, reminiscent of ammonia ice and rocky particles, within the rings. The precise composition of the rings varies relying on the gap from Saturn, with the interior rings being denser and made up of smaller particles, whereas the outer rings are extra unfold out and include bigger ice particles.”

“Inform me the plot of the film ‘Again to the Future’.”

“The film ‘Again to the Future’ follows the story of Marty McFly, a highschool scholar who’s unintentionally transported again to 1955 in a time-traveling DeLorean automobile created by his eccentric scientist buddy, Doc Brown. Marty should navigate the challenges of the previous whereas attempting to restore the harm to the current and stop his teenage mom from falling in love together with his teenage father. Alongside the best way, Marty befriends a youthful model of Doc Brown and helps him overcome a private tragedy. The film explores themes of destiny, household, and the implications of altering the previous.”

“Thanks and goodbye”

“Goodbye” [beep-beep]

On this undertaking, we carried out a voice assistant system composed of a wake-word detection service, a voice assistant service, and a chat service. Every considered one of these parts is a standalone constructing block reusable for different tasks.

Among the many many surprises I had whereas engaged on this voice assistant, what struck me essentially the most was the standard of the speech-to-text conversion. When you’re like me, you have got most likely struggled with automated voice recognition methods that fail to transcribe easy instructions reminiscent of “Flip down the amount”! I anticipated speech-to-text conversion to be the principle stumbling block of the pipeline. After experimenting with a couple of unsatisfying fashions, I landed on fb/seamless-m4t-v2-large and was impressed with the standard of the outcomes. I may even converse a sentence in French, and the neural community will mechanically translate it into English. Nothing lower than superb!

I hope you’ll do this enjoyable undertaking, and let me know what you employ it for!

Leave a Reply

Your email address will not be published. Required fields are marked *