wav2vec2-large-xlsr-53-russian is a fine-tuned automatic speech recognition (ASR) model based on Facebook’s wav2vec2-large-xlsr-53 and optimized for Russian. It was trained using Mozilla’s Common Voice 6.1 and CSS10 datasets to recognize Russian speech with high accuracy. The model operates best with audio sampled at 16kHz and can transcribe Russian speech directly without a language model. It achieves a Word Error Rate (WER) of 13.3% and Character Error Rate (CER) of 2.88% on the Common Voice test set, with even better results when used with a language model. The model supports both PyTorch and JAX and is compatible with the Hugging Face Transformers and HuggingSound libraries. It is ideal for Russian voice transcription tasks in research, accessibility, and interface development. The training was made possible with compute support from OVHcloud, and the training scripts are publicly available for replication.
Features
- Fine-tuned on Common Voice 6.1 and CSS10 Russian datasets
- Based on Facebook’s wav2vec2-large-xlsr-53 pretrained model
- Supports input sampled at 16kHz for optimal performance
- Achieves 13.3% WER and 2.88% CER (without language model)
- Improved WER (9.57%) and CER (2.24%) with language model integration
- Usable with HuggingSound or custom inference pipelines
- Available under Apache-2.0 license for commercial and research use
- Compatible with PyTorch and the Hugging Face Transformers ecosystem