Speech to Textual content to Speech with AI Utilizing Python — a How-To Information | by Naomi Kriger | Feb, 2024


Thank you for reading this post, don't forget to subscribe!

How one can Create a Speech-to-Textual content-to-Speech Program

Naomi Kriger

Towards Data Science
Picture by Mariia Shalabaieva from unsplash

It’s been precisely a decade since I began attending GeekCon (sure, a geeks’ convention 🙂) — a weekend-long hackathon-makeathon wherein all tasks should be ineffective and just-for-fun, and this yr there was an thrilling twist: all tasks had been required to include some type of AI.

My group’s venture was a speech-to-text-to-speech sport, and right here’s the way it works: the consumer selects a personality to speak to, after which verbally expresses something they’d wish to the character. This spoken enter is transcribed and despatched to ChatGPT, which responds as if it had been the character. The response is then learn aloud utilizing text-to-speech expertise.

Now that the sport is up and working, bringing laughs and enjoyable, I’ve crafted this how-to information that can assist you create an analogous sport by yourself. All through the article, we’ll additionally discover the assorted concerns and choices we made in the course of the hackathon.

Wish to see the complete code? Right here is the hyperlink!

As soon as the server is working, the consumer will hear the app “speaking”, prompting them to decide on the determine they need to discuss to and begin conversing with their chosen character. Every time they need to discuss out loud — they need to press and maintain a key on the keyboard whereas speaking. After they end speaking (and launch the important thing), their recording might be transcribed by Whisper (a text-to-speech mannequin by OpenAI), and the transcription might be despatched to ChatGPT for a response. The response might be learn out loud utilizing a text-to-speech library, and the consumer will hear it.

Disclaimer

Observe: The venture was developed on a Home windows working system and incorporates the pyttsx3 library, which lacks compatibility with M1/M2 chips. As pyttsx3 will not be supported on Mac, customers are suggested to discover different text-to-speech libraries which can be suitable with macOS environments.

Openai Integration

I utilized two OpenAI fashions: Whisper, for speech-to-text transcription, and the ChatGPT API for producing responses based mostly on the consumer’s enter to their chosen determine. Whereas doing so prices cash, the pricing mannequin may be very low cost, and personally, my invoice continues to be beneath $1 for all my utilization. To get began, I made an preliminary deposit of $5, and thus far, I’ve not exhausted this accretion, and this preliminary deposit received’t expire till a yr from now.
I’m not receiving any fee or advantages from OpenAI for penning this.

When you get your OpenAI API key — set it as an surroundings variable to make use of upon making the API calls. Be certain to not push your key to the codebase or any public location, and to not share it unsafely.

Speech to Textual content — Create Transcription

The implementation of the speech-to-text characteristic was achieved utilizing Whisper, an OpenAI mannequin.

Beneath is the code snippet for the operate accountable for transcription:

async def get_transcript(audio_file_path: str, 
text_to_draw_while_waiting: str) -> Optionally available[str]:
openai.api_key = os.environ.get("OPENAI_API_KEY")
audio_file = open(audio_file_path, "rb")
transcript = None

async def transcribe_audio() -> None:
nonlocal transcript
strive:
response = openai.Audio.transcribe(
mannequin="whisper-1", file=audio_file, language="en")
transcript = response.get("textual content")
besides Exception as e:
print(e)

draw_thread = Thread(goal=print_text_while_waiting_for_transcription(
text_to_draw_while_waiting))
draw_thread.begin()

transcription_task = asyncio.create_task(transcribe_audio())
await transcription_task

if transcript is None:
print("Transcription not out there inside the specified timeout.")

return transcript

This operate is marked as asynchronous (async) for the reason that API name might take a while to return a response, and we await it to make sure that this system doesn’t progress till the response is obtained.

As you may see, the get_transcript operate additionally invokes the print_text_while_waiting_for_transcription operate. Why? Since acquiring the transcription is a time-consuming job, we wished to maintain the consumer knowledgeable that this system is actively processing their request and never caught or unresponsive. Because of this, this textual content is progressively printed because the consumer awaits the subsequent step.

String Matching Utilizing FuzzyWuzzy for Textual content Comparability

After transcribing the speech into textual content, we both utilized it as is, or tried to check it with an current string.

The comparability use instances had been: deciding on a determine from a predefined record of choices, deciding whether or not to proceed enjoying or not, and when opting to proceed – deciding whether or not to decide on a brand new determine or stick to the present one.

In such instances, we wished to check the consumer’s spoken enter transcription with the choices in our lists, and due to this fact we determined to make use of the FuzzyWuzzy library for string matching.

This enabled selecting the closest choice from the record, so long as the matching rating exceeded a predefined threshold.

Right here’s a snippet of our operate:

def detect_chosen_option_from_transcript(
transcript: str, choices: Listing[str]) -> str:
best_match_score = 0
best_match = ""

for choice in choices:
rating = fuzz.token_set_ratio(transcript.decrease(), choice.decrease())
if rating > best_match_score:
best_match_score = rating
best_match = choice

if best_match_score >= 70:
return best_match
else:
return ""

If you wish to study extra concerning the FuzzyWuzzy library and its features — you may take a look at an article I wrote about it right here.

Get ChatGPT Response

As soon as we now have the transcription, we will ship it over to ChatGPT to get a response.

For every ChatGPT request, we added a immediate asking for a brief and humorous response. We additionally instructed ChatGPT which determine to faux to be.

So our operate regarded as follows:

def get_gpt_response(transcript: str, chosen_figure: str) -> str:
system_instructions = get_system_instructions(chosen_figure)
strive:
return make_openai_request(
system_instructions=system_instructions,
user_question=transcript).selections[0].message["content"]
besides Exception as e:
logging.error(f"couldn't get ChatGPT response. error: str(e)")
increase e

and the system directions regarded as follows:

def get_system_instructions(determine: str) -> str:
return f"You present humorous and quick solutions. You might be: determine"

Textual content to Speech

For the text-to-speech half, we opted for a Python library referred to as pyttsx3. This alternative was not solely simple to implement but in addition supplied a number of extra benefits. It’s freed from cost, offers two voice choices — female and male — and means that you can choose the talking price in phrases per minute (speech pace).

When a consumer begins the sport, they decide a personality from a predefined record of choices. If we couldn’t discover a match for what they mentioned inside our record, we’d randomly choose a personality from our “fallback figures” record. In each lists, every character was related to a gender, so our text-to-speech operate additionally obtained the voice ID equivalent to the chosen gender.

That is what our text-to-speech operate regarded like:

def text_to_speech(textual content: str, gender: str = Gender.FEMALE.worth) -> None:
engine = pyttsx3.init()

engine.setProperty("price", WORDS_PER_MINUTE_RATE)
voices = engine.getProperty("voices")
voice_id = voices[0].id if gender == "male" else voices[1].id
engine.setProperty("voice", voice_id)

engine.say(textual content)
engine.runAndWait()

The Essential Movement

Now that we’ve roughly bought all of the items of our app in place, it’s time to dive into the gameplay! The primary circulation is printed beneath. You would possibly discover some features we haven’t delved into (e.g. choose_figure, play_round), however you may discover the complete code by testing the repo. Ultimately, most of those higher-level features tie into the inner features we’ve lined above.

Right here’s a snippet of the principle sport circulation:

import asyncio

from src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, begin, play_round,
is_another_round

def farewell() -> None:
farewell_message = "It was nice having you right here, "
"hope to see you once more quickly!"
print(f"nfarewell_message")
text_to_speech(farewell_message)

async def get_round_settings(determine: str) -> dict:
new_round_choice = await is_another_round()
if new_round_choice == "new determine":
return "determine": "", "another_round": True
elif new_round_choice == "no":
return "determine": "", "another_round": False
elif new_round_choice == "sure":
return "determine": determine, "another_round": True

async def most important():
begin()
another_round = True
determine = ""

whereas True:
if not determine:
determine = await choose_figure()

whereas another_round:
await play_round(chosen_figure=determine)
user_choices = await get_round_settings(determine)
determine, another_round =
user_choices.get("determine"), user_choices.get("another_round")
if not determine:
break

if another_round is False:
farewell()
break

if __name__ == "__main__":
asyncio.run(most important())

We had a number of concepts in thoughts that we didn’t get to implement in the course of the hackathon. This was both as a result of we didn’t discover an API we had been glad with throughout that weekend, or as a result of time constraints stopping us from creating sure options. These are the paths we didn’t take for this venture:

Matching the Response Voice with the Chosen Determine’s “Precise” Voice

Think about if the consumer selected to speak to Shrek, Trump, or Oprah Winfrey. We wished our text-to-speech library or API to articulate responses utilizing voices that matched the chosen determine. Nonetheless, we couldn’t discover a library or API in the course of the hackathon that supplied this characteristic at an affordable value. We’re nonetheless open to ideas you probably have any =)

Let the Customers Speak to “Themselves”

One other intriguing concept was to immediate customers to supply a vocal pattern of themselves talking. We’d then practice a mannequin utilizing this pattern and have all of the responses generated by ChatGPT learn aloud within the consumer’s personal voice. On this situation, the consumer might select the tone of the responses (affirmative and supportive, sarcastic, indignant, and many others.), however the voice would carefully resemble that of the consumer. Nonetheless, we couldn’t discover an API that supported this inside the constraints of the hackathon.

Including a Frontend to Our Software

Our preliminary plan was to incorporate a frontend element in our software. Nonetheless, on account of a last-minute change within the variety of contributors in our group, we determined to prioritize the backend growth. Because of this, the applying at present runs on the command line interface (CLI) and doesn’t have frontend facet.

Latency is what bothers me most in the intervening time.

There are a number of elements within the circulation with a comparatively excessive latency that for my part barely hurt the consumer expertise. For instance: the time it takes from ending offering the audio enter and receiving a transcription, and the time it takes for the reason that consumer presses a button till the system truly begins recording the audio. So if the consumer begins speaking proper after urgent the important thing — there might be a minimum of one second of audio that received’t be recorded on account of this lag.

Wish to see the entire venture? It’s proper right here!

Additionally, heat credit score goes to Lior Yardeni, my hackathon associate with whom I created this sport.

On this article, we realized how one can create a speech-to-text-to-speech sport utilizing Python, and intertwined it with AI. We’ve used the Whisper mannequin by OpenAI for speech recognition, performed round with the FuzzyWuzzy library for textual content matching, tapped into ChatGPT’s conversational magic by way of their developer API, and introduced all of it to life with pyttsx3 for text-to-speech. Whereas OpenAI’s companies (Whisper and ChatGPT for builders) do include a modest value, it’s budget-friendly.

We hope you’ve discovered this information enlightening and that it’s motivating you to embark in your tasks.

Cheers to coding and enjoyable! 🚀



Leave a Reply

Your email address will not be published. Required fields are marked *