Author: Luis Damián Moreno García
Parakeet 2, officially known as Parakeet-TDT-0.6b-V2, was recently released.
It is currently on the top of Hugging Face’s Open ASR Leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), with a Word Error Rate (WER) of only 6.05%. This means few errors, which can help researchers better transcribe interviews and focus groups, and do less post-editing.
It can potentially transcribe an hour of audio in about a second. This is useful for large volumes of content. For example, it could be used for PhD-level research purposes, such as thematic analysis from a large corpus of subtitles. It may also be used to provide live transcriptions, and its reception could be investigated.
Third, it is fully open-source under the CC-BY-4.0 license. This means it can be used for commercial and non-commercial, as well as adapted to better understand specific accents, dialects, etc.
Unfortunately, at present it is limited to English speech recognition…
You can try the demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2
You can upload your own audio files or use your microphone to record your voice. Then, click the Transcribe Uploaded File button in green (see image below). After a very short pause the script is timed with starting and ending times. You can then download the Transcript in CSV format by clicking the grey button.

A nice feature is that you can click on any segment in the transcript that has been generated to play it.

At present, it can deal with clips that are 3 hours in length (and even longer audios if you check this script: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py).
Have fun!
Leave a comment