[Georgi Gerganov] not too long ago shared an ideal useful resource for operating high-quality AI-driven speech recognition in a plain C/C++ implementation on quite a lot of platforms. The automated speech recognition (ASR) mannequin is absolutely applied utilizing solely two supply recordsdata and requires no dependencies. In consequence, the high-quality speech recognition doesn’t contain calling distant APIs, and might run domestically on completely different gadgets in a reasonably simple method. The picture above exhibits it operating domestically on an iPhone 13, however it might probably do greater than that.
[Georgi]’s work is a port of OpenAI’s Whisper mannequin, a remarkably-robust piece of software that does a very spectacular job of turning human speech into textual content. Whisper is simple to arrange and play with, however this port makes it simpler to get the system working in different methods. Having such a light-weight implementation of the mannequin means it may be extra simply built-in over quite a lot of completely different platforms and initiatives.
The standard means that OpenAI’s Whisper works is to feed it an audio file, and it spits out a transcription. However [Georgi] exhibits off one thing else which may begin giving hackers concepts: a simple real-time audio input example.
By utilizing a software to stream audio and feed it to the system each half-second, one can get hold of fairly good (kind of) real-time outcomes! This after all isn’t an excellent methodology, however the robustness and accuracy of Whisper is such that the outcomes look fairly nice nonetheless.
You’ll be able to watch a fast demo of that within the video slightly below the web page break. If it offers you some concepts, head over to the project’s GitHub repository and get hackin’!