BorisovAI
All projects

ScribeAir — Offline Speech-to-Text for Windowsv2.0.12

Voice input and speech recognition on Windows without the cloud. 3.3% WER on Russian, speaker diarization for meeting transcripts, five modes for different recording profiles. Free, open source.

ScribeAir — Offline Speech-to-Text for Windows
Speech-to-Text
Local speech recognition without the cloud.
Voice Input
Text appears at the cursor in any program.
Offline
All processing on your device. No internet needed.
Open Source
MIT license. Full source code available.
Free
Free for personal and commercial use.
GigaAM
Russian acoustic model from Sberbank. 700 000 hours of training data.
Whisper
OpenAI Whisper large-v3-turbo, multilingual.
Cascade AO
Whisper-AI primary, GigaAM falls in on low-confidence segments.
Speaker Diarization
Labels who is speaking in the transcript, up to four voices.
Meeting Transcription
Recordings of calls and interviews turn into a ready-to-edit document.
Wake Word
Say "Запись" to start, "Стоп" to finish.
Push-to-Talk
Hold a hotkey, speak, release. Text appears instantly.
Bilingual
Russian and English, including mixed segments in one recording.
RU+EN
Per-segment language detection during recording.
ONNX
Optimized runtime. Fast on CPU and GPU.
GPU Accelerated
CUDA support for faster transcription.
3.3% WER
Word error rate on Russian audiobook benchmark.
Windows
Windows 10 and 11, x64.
UtilitiesPython 3.12GigaAM v3Whisper large-v3-turboONNX RuntimeCTranslate2faster-whisperSilero VADpyannote-segmentation-3.0WeSpeakeropenWakeWordpymorphy3tkinterPyInstaller

Screenshots

Documentation

ScribeAir turns speech into text at the cursor in any program. Hold a hotkey, say the phrase, and it appears where you were typing: an editor, a chat, an email, source code. Audio is processed locally and never leaves your computer.

The app helps anyone who dictates faster than they type: on calls, when transcribing interviews, when writing documentation. The models are tuned for Russian, where they outperform the usual alternatives. The main one is GigaAM by Sberbank, trained on 700,000 hours of Russian speech. On clean recordings it scores 3.3% word error rate — better than Google Speech-to-Text (around 10%) and Dragon (around 8%).

How you use it

After launch a microphone icon appears in the system tray. Once models finish loading, the icon goes dim and recording becomes available. Two ways to record:

  • Hold the configured hotkey (Win+Ctrl by default), say your phrase, release — text appears at the cursor.
  • Turn on wake-word activation and say "запись". To stop, say "стоп". Hands-free, no keys involved.

While recording, a small overlay floats above other windows: you see what is being recognised right now and whether the microphone is picking anything up at all.

Five recognition modes

You switch between them in a couple of seconds from the tray menu — no reinstall. Each mode is a concrete set of models matched to a concrete recording profile.

Auto. On a machine with an NVIDIA GPU, audio goes to Whisper; on CPU-only machines, GigaAM for Russian and Whisper for English or mixed. Sensible default.

Hybrid. Both models loaded at once. Each speech segment routes to its language: Russian to GigaAM, English to Whisper. Useful for conversations with language switching, for presentations with English terms in Russian context.

Cascade AO. The Russian-tuned Whisper takes the first pass. If it itself signals low confidence in the result (its segment log-probability drops below -0.20), the output is replaced with GigaAM's. This mode helps on variable-quality recordings: meetings with remote participants, dictation into a quiet microphone, noisy rooms. It also clears Whisper's signature hallucinations that slip in on short or unclear fragments: false subtitle credits ("Корректор А. Семкин", "Субтитры создавал DimaTorzok"), repeating "Thank you. Thank you." loops, accidental Icelandic streams.

Whisper. Whisper only. Pick it when you need a universal multilingual model and a GPU is available.

GigaAM. GigaAM only. Lowest latency (0.66s on CPU), punctuation built into the model, returns an empty string on silence rather than a made-up phrase. Single catch: Russian only.

Meeting and interview transcription

Speaker diarization is built in. pyannote-segmentation-3.0 finds turn boundaries; the WeSpeaker network produces a voice embedding for each speaker; voices cluster into stable identifiers. The output text contains labels like [Speaker 1]:, [Speaker 2]:, up to four speakers.

When you stop recording, the Stage N pass kicks in: pyannote re-segments the whole recording with full context, and each turn is then transcribed end-to-end. Punctuation becomes meaningful — the model doesn't have to guess where a phrase, cut mid-word, was supposed to end. With Cascade AO selected, you get meeting minutes ready to paste into a document.

Very short replies (one-word "yes") need at least 1.5 seconds of audio so the network can build a stable voice embedding. Anything shorter may attach to the previous speaker. The final Stage N pass fixes most such cases.

Who benefits

Developers and technical writers. The app substitutes Latin names where you speak Russian: «питон» becomes Python, «гугл» — Google, «питест» — pytest. The pymorphy3 library parses Russian cases and word forms, so «питоне», «питоном», «питонов» all map to Python as well. The default dictionary contains 81 terms and is easy to extend through the settings file.

Anyone recording calls. Pick Cascade AO, tick the speaker diarization checkbox, record with the usual hotkey. After stop you have speaker-labeled text ready for editing or for uploading into a meeting-minutes system.

Journalists and bloggers. Long-form dictation and interview transcription run streaming. Progress is visible immediately — no waiting for the recording to finish.

People with limited hand mobility. Wake-word activation replaces every hotkey and lets you drive the app by voice alone.

Quality and speed

Benchmarks on Russian audiobooks. Word error rate and time to process one utterance:

  • GigaAM v3-e2e-rnnt: 3.3% word error rate, 0.66s on a single CPU core, 0.40s on GPU.
  • Whisper large-v3-turbo on GPU: 7.9%, 0.44s.
  • Whisper large-v3 on GPU: 8.8%, 2.30s.
  • Whisper base on CPU (baseline): 32.6%, 0.42s.

On a separate 1717-sample mixed-quality set — quiet sections, distant voices, noise — Cascade AO scores 9.4% errors versus 10.2% for Whisper and 10.9% for GigaAM alone. The combination is more accurate than either model in isolation precisely on difficult recordings.

For reference, Windows native voice typing produces about 25% errors on Russian, Google Speech-to-Text about 10%, Dragon about 8%. GigaAM on CPU is more accurate than any Whisper running on an RTX 4090 GPU.

Technology

The stack is conservative — proven libraries, no in-house experiments dragging huge dependencies:

  • Speech recognition: faster-whisper on CTranslate2 for Whisper, onnx-asr for GigaAM. The same interface works on CPU and GPU.
  • Voice activity detection: Silero VAD in ONNX. Reacts in milliseconds, doesn't miss speech, doesn't hang on long pauses.
  • Speaker diarization: pyannote-segmentation-3.0 for turn boundaries, WeSpeaker (voxceleb-resnet34-LM) for voice embeddings. Both ONNX — no PyTorch or TensorFlow on the user side.
  • Wake-word activation: openWakeWord with a custom bi-LSTM model on "запись" and "стоп". Trained on 1000+ synthetic samples from five voices plus a real-microphone set.
  • Morphology: pymorphy3 parses Russian cases and word forms when replacing IT terms.
  • Interface: tkinter for the settings window and the floating overlay, pystray for the tray icon.
  • Packaging: PyInstaller with a side-by-side install layout. Updates land next to the old build without touching the running process.

System requirements

  • Windows 10 or Windows 11, x64.
  • 8 GB RAM is enough for the CPU build; 16 GB is more comfortable with the CUDA build.
  • Any microphone — USB or laptop-built-in.
  • Optional: NVIDIA with 4 GB VRAM or higher, CUDA 12.x with cuDNN 9.x. With such a GPU recognition is faster than you can speak.

Privacy

Audio, transcripts and intermediate results stay on the computer. The app sends nothing on its own, except in two explicit cases:

  • Once an hour an update check goes to the project mirror to fetch a signed manifest. The check is easy to disable in Settings.
  • If you have opted in to crash reporting, a crash sends a log tail and system info. Audio is never part of the report. The receiving endpoint is configurable.

No behavioural analytics, no advertising identifiers, no telemetry. MIT-licensed, source code is fully open.

Updates and release channels

Updates are signed with an Ed25519 key and flow through two channels. Stable is on by default — only vetted builds. Beta is where new versions land first under suffixes like v2.0.10-beta1; stable users never see them. If a new version fails to start, the app rolls back to the previous one automatically.

Installation

The GitFlic releases page hosts two ZIP builds: CPU (about 180 MB) and CUDA (about 2.7 GB). Download, unzip, run ScribeAir.exe. Models auto-download on first launch; everything works offline afterwards.

If HuggingFace is blocked by your ISP, the repository describes a manual install procedure using the project mirror.

Rate this content

0/1000