SAM 3D Body is a promptable model for single-image full-body 3D human mesh recovery. It estimates pose and shape from one image, supports prompts like keypoints or masks, and outputs a structured human mesh representation that is animation-friendly.
This post documents a quick experiment: running SAM 3D Body frame-by-frame on a video to extract a mesh and joints, then thinking through what that enables for motion capture and video-to-mocap pipelines in Unreal Engine.
Test clip: mesh + joints from video
The video below is a test of me extracting joints and a mesh from a video instead of a single image. Each frame is processed independently, then stitched into a short sequence for visualization.
What changes when you go from image to video
SAM 3D Body is trained for single-image reconstruction, so video inference is essentially a per-frame pipeline. The interesting part is the extra structure video gives you:
- Temporal consistency: even a simple smoothing pass can reduce frame-to-frame jitter.
- Occlusion recovery: a limb hidden in one frame often appears in neighboring frames.
- Motion cues: small motions help resolve ambiguous poses that are hard from one frame alone.
That does not make it a native video model, but it does make the output more usable for animation and tracking.
Implications for motion capture + Unreal Engine motion matching
If you can recover a clean joint hierarchy and mesh per frame, the next steps are familiar:
- Stabilize the track -- apply temporal smoothing and fix joint flips.
- Retarget -- map the recovered joints to an Unreal Engine-compatible skeleton.
- Curate clips -- segment the sequence into motion-matching-ready clips.
- Blend and iterate -- feed it into Motion Matching, then tune blend spaces and cost weights.
The big implication: lightweight video-to-mocap workflows become possible without a full marker-based capture setup. You can prototype moves, test blocking, or generate animation references directly from a phone video.
What I'm taking away
- Single-image models become far more practical when you add temporal context.
- The mesh output is useful even when you primarily need joints; it gives you pose sanity checks.
- The Unreal Engine pipeline is viable as soon as the joint stream is stable.
Next experiments
- Introduce temporal priors to reduce jitter without over-smoothing.
- Compare a per-frame pipeline vs. a windowed optimization approach.
- Test retargeting quality on fast turns and foot contacts.