This article was automatically translated from the original Turkish version.

Speech-to-Text (STT)

Quote

Speech-to-Text (STT) or Automatic Speech Recognition (ASR) is a fundamental technology that converts spoken language into written text, enabling the processing and analysis of audio data in digital environments by facilitating human-machine interaction. Recent advances in Large Language Models (LLMs) have led to significant improvements in the capabilities of STT systems.

Development of Speech-to-Text Technology
ASR technology has a long history, with traditional approaches relying on statistical methods such as Hidden Markov Models (HMMs). The rise of deep learning and sequence-to-sequence models like Transformer has enabled the development of End-to-End (E2E) systems that directly map audio input to text output, simplifying the process and often improving performance. A major advancement in this field has been the popularization of self-supervised learning (SSL) through models such as wav2vec 2.0; this approach allows models to learn fundamental properties of speech using large amounts of unlabeled audio data, enabling pre-trained models to achieve high accuracy when fine-tuned with only small amounts of labeled data.

Prominent Open-Source LLM-Based STT Systems
The integration of LLMs into speech recognition has led to the development of numerous powerful open-source systems, the most well-known being Whisper, developed by OpenAI. Whisper exhibits high robustness across different languages, accents, and noisy conditions due to its training on large-scale and diverse weakly supervised data. Its multilingual architecture, multiple size variants, and open-source license have contributed to its widespread adoption, and efficient deployment on various platforms is possible through projects such as whisper.cpp. Meta AI’s projects like Seamless Communication unify multiple tasks within a single model, while the MMS project focuses on low-resource languages. Tools such as NVIDIA NeMo provide resources for training and deploying models using advanced architectures like Conformer; these models are typically trained by combining diverse datasets.

Turkish Speech-to-Text Research
Turkish’s agglutinative structure and rich morphology create unique challenges for STT systems, such as out-of-vocabulary words and segmentation difficulties. Additionally, the limited availability of publicly labeled data compared to English has directed research efforts toward adapting pre-trained multilingual models to Turkish. Whisper’s multilingual capabilities have made it a popular base model for Turkish, with community fine-tuned versions—such as sgangireddy/whisper-medium-tr—shared on platforms like Hugging Face using datasets like Common Voice; these models generally outperform the base model in Turkish. Comparative analyses have shown that Whisper-based fine-tuned models can outperform previous-generation SSL models like XLS-R, particularly in recognizing Turkish speech from diverse domains.

Training Methods and Data Sources
The dominant approach in developing modern LLM-based STT systems is the “Pre-training and Fine-tuning” paradigm, where models first learn general speech patterns on very large amounts of often unlabeled audio data (pre-training), and are then optimized using smaller, labeled datasets specific to the target language or task (fine-tuning). Publicly licensed datasets created through mass participation, such as Common Voice, play a crucial role in fine-tuning and evaluating Turkish STT models. However, the massive datasets used for pre-training large models are typically compiled from web sources and are not always publicly accessible.

Challenges in Fine-Tuning Large Models
Adapting large pre-trained models to specific tasks introduces technical challenges including the need to collect sufficient and high-quality task-specific labeled data, the risk of catastrophic forgetting where the model loses some of its general capabilities acquired during pre-training, the risk of hallucination where the model generates text not present in the input, significant computational costs (GPU usage and time), and potential mismatches between the data distribution used for training and the application domain. To address these challenges, strategies such as parameter-efficient fine-tuning (PEFT), data augmentation techniques, and architectural improvements are employed.

Application Areas
LLM-based STT technology is currently used across a wide range of applications including transcription of meetings, lectures, interviews, and media content; voice command systems and virtual assistants; analysis of call center conversations; dictation-based text input (particularly in medicine and law); automatic subtitle generation for media; pronunciation assessment in language learning platforms; and accessibility technologies—for example, for the hearing impaired. The availability of open-source models enables organizations and developers to create custom, data-privacy-preserving STT solutions tailored to their specific needs.

Bibliographies

Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." arXiv preprint arXiv:2006.11477.

Mercan, Ozan Burak, Hatice A. Aksu, Mehmet Eryiğit, and Efnan Mercan. 2023. "Performance Comparison of Fine-Tuned Whisper Models and XLS-R-300M for Turkish Speech-to-Text." arXiv preprint arXiv:2307.04765.

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv preprint arXiv:2212.04356.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." arXiv preprint arXiv:1706.03762.

Author Information

AuthorAbdullah AydoğanDecember 5, 2025 at 2:18 PM

Discussions

No Discussion Added Yet

Start discussion for "Speech-to-Text (STT)" article

View Discussions

Development of Speech-to-Text Technology
Prominent Open-Source LLM-Based STT Systems
Turkish Speech-to-Text Research
Training Methods and Data Sources
Challenges in Fine-Tuning Large Models
Application Areas

Speech-to-Text (STT)

Development of Speech-to-Text Technology

Prominent Open-Source LLM-Based STT Systems

Turkish Speech-to-Text Research

Training Methods and Data Sources

Challenges in Fine-Tuning Large Models

Application Areas

Bibliographies

Author Information

Tags

Discussions

Contents