SAM 3D Body: Video-Driven Mocap Experiments

SAM 3D Body is a promptable model for single-image full-body 3D human mesh recovery. It estimates pose and shape from one image, supports prompts like keypoints or masks, and outputs a structured human mesh representation that is animation-friendly.

This post documents a quick experiment: running SAM 3D Body frame-by-frame on a video to extract a mesh and joints, then thinking through what that enables for motion capture and video-to-mocap pipelines in Unreal Engine.

Test clip: mesh + joints from video

The video below is a test of me extracting joints and a mesh from a video instead of a single image. Each frame is processed independently, then stitched into a short sequence for visualization.

What changes when you go from image to video

SAM 3D Body is trained for single-image reconstruction, so video inference is essentially a per-frame pipeline. The interesting part is the extra structure video gives you:

Temporal consistency: even a simple smoothing pass can reduce frame-to-frame jitter.
Occlusion recovery: a limb hidden in one frame often appears in neighboring frames.
Motion cues: small motions help resolve ambiguous poses that are hard from one frame alone.

That does not make it a native video model, but it does make the output more usable for animation and tracking.

Implications for motion capture + Unreal Engine motion matching

If you can recover a clean joint hierarchy and mesh per frame, the next steps are familiar:

Stabilize the track -- apply temporal smoothing and fix joint flips.
Retarget -- map the recovered joints to an Unreal Engine-compatible skeleton.
Curate clips -- segment the sequence into motion-matching-ready clips.
Blend and iterate -- feed it into Motion Matching, then tune blend spaces and cost weights.

The big implication: lightweight video-to-mocap workflows become possible without a full marker-based capture setup. You can prototype moves, test blocking, or generate animation references directly from a phone video.

What I'm taking away

Single-image models become far more practical when you add temporal context.
The mesh output is useful even when you primarily need joints; it gives you pose sanity checks.
The Unreal Engine pipeline is viable as soon as the joint stream is stable.

Next experiments

Introduce temporal priors to reduce jitter without over-smoothing.
Compare a per-frame pipeline vs. a windowed optimization approach.
Test retargeting quality on fast turns and foot contacts.