I want to develop an app with real-time conversational capabilities. Basically it would run as follows:
- The user would say something.
- The app would then (1) convert voice-to-text (2) generating a reply via an LLM (3) convert the reply via text-to-speech to generate an output
- Repeat steps 1 & 2 in real-time throughout the conversation
I know all three of the requires pieces here (text-to-speech, speech-to-text, LLMs) are somewhat of a commodity these days, but I am wondering if there are any open source tools or projects which have glued these altogether?