BorisovAI
All posts
New Featuretrend-analisisClaude Code

Learning Success by Video: Modular Policy Training with Simulation Filtering

Learning Success by Video: Modular Policy Training with Simulation Filtering

I recently dove into an interesting problem while working on the Trend Analysis project: how do you train an AI policy to succeed without getting lost in noisy simulation data? The answer turned out to be more nuanced than I expected.

The core challenge was modular policy learning with simulation filtering from human video. We weren’t trying to build a general-purpose robot controller—we were targeting something more specific: learning behavioral patterns from real human demonstrations, then filtering out the synthetic data that didn’t match those patterns well.

Here’s what made this tricky. Raw video contains all sorts of noise: camera artifacts, inconsistent lighting, human movements that don’t generalize well. But simulation data is too clean—it’s perfect in ways that real execution never is. When you train a policy on both equally, it learns to expect a world that doesn’t exist.

Our approach? Modular decomposition. Instead of one monolithic policy, we broke the learning into stages:

  1. Extract core behaviors from human video using vision-language models (Claude’s multimodal capabilities proved invaluable here)
  2. Score simulation trajectories against these behaviors—keeping only trajectories that matched human-like decision patterns
  3. Layer modular policies that could be composed for different tasks

The filtering stage was crucial. We used Claude to analyze video frames and extract the intent behind each action—not just the kinematics. A human reaching for something has context: they know where it is, why they need it, what obstacles exist. Raw simulation might generate the same trajectory, but without that reasoning backbone, the policy becomes brittle.

The tradeoff was real though. By filtering aggressively, we reduced our training dataset significantly. More data would mean faster convergence, but noisier policies. We chose quality over quantity—better a robust policy trained on 500 carefully-filtered trajectories than a confused one trained on 5,000 messy ones.

One moment crystallized the value of this approach: our trained policy handled an unexpected obstacle smoothly, not by overfitting to video data, but because it had learned the reasoning behind human decisions. The policy understood why humans move certain ways, not just the mechanical how.

This work sits at the intersection of imitation learning, video understanding, and reinforcement learning—three domains that rarely talk to each other cleanly. By filtering simulation through human video understanding, we bridged that gap.

Tech fact: The term “distribution shift” describes exactly this problem—when training and deployment conditions differ. Video-to-simulation bridging is one elegant way to keep your policy honest.

There are only 10 kinds of people in this world: those who understand simulation filtering and those who don’t. 😄

Metadata

Session ID:
grouped_trend-analisis_20260219_1829
Branch:
refactor/signal-trend-model
Dev Joke
Что будет, если Neovim обретёт сознание? Первым делом он удалит свою документацию

Rate this content

0/1000