BorisovAI
All posts
New Featurespeech-to-textGit Commit

Regex Anchors and Unicode: Two Lessons From Shipping CUDA

Regex Anchors and Unicode: Two Lessons From Shipping CUDA

I was building a local release pipeline for our Speech-to-Text project—specifically, a parameterized publish_cuda.sh script that would handle CUDA builds independently of CI. The CI system only produces CPU binaries; CUDA releases needed to be published locally but still signed with the same ed25519 manifest so users would get seamless auto-updates. Straightforward automation, or so I thought.

The script itself was clean: it took a VERSION argument (defaulting to what’s in src/version.py) and a CHANNEL (stable or beta). For suffix versions like 2.0.10-beta1, it would set SCRIBEAIR_VERSION_OVERRIDE before calling build.py, ensuring the package layout matched expectations. Standard stuff.

Then I started publishing v2.0.9 locally and things broke.

The first issue was deceptively simple: build.py has a regex that reads the app version from src/version.py. It was looking for __version__ = "...", but I’d left a stale docstring example in the file—a plain "X.Y.Z" literal sitting in a comment. The regex matched it instead of the real version assignment. The fix? Move the canonical __version__ = "2.0.9" to the very first top-level assignment, so there’s nothing to confuse the parser. But that wasn’t enough.

The real fix was in build.py itself. The regex pattern wasn’t anchored. I switched it to use re.MULTILINE with a ^ anchor: ^__version__. Now it can only match at the start of a line—comments and docstrings don’t stand a chance. It’s the kind of thing that seems obvious in hindsight but easy to overlook when you’re chasing version parsing across multiple files.

Then came the character encoding nightmare. Our voice_app.spec had a unicode arrow in a print statement. Innocent enough on most systems, but when that code ran on Windows with cp1251 console encoding, it crashed hard. The fix was mechanical—replace with ->. Yet it revealed something worth remembering: when you’re building cross-platform tools, especially for Windows, even a stray unicode character can detonate your pipeline.

These weren’t dramatic bugs. No services went down. No users were affected. But they illustrated why parameterized build automation demands paranoia: every assumption gets tested, every regex needs anchors, and every character encoding assumption will eventually betray you in production.

The script shipped, CUDA binaries are now published locally with proper signatures, and the version handling is rock-solid. Worth the debugging time.

What are bits? Tiny things left when you drop your computer down the stairs. 😄

Metadata

Branch:
master
Dev Joke
Почему Kubernetes лучший друг разработчика? Потому что без него ничего не работает. С ним тоже, но хотя бы есть кого винить

Rate this content

0/1000