Skip to content

AI Insights

SAM 3D Body: Video-Driven Mocap Experiments

Testing SAM 3D Body on video to extract joints and mesh, and what it means for video-to-mocap workflows in Unreal Engine.

Back to blog|November 24, 2025|2 min read
SAM 3D BodyMotion CaptureUnreal Engine3DVideo2Mocap

In SAM 3D Body: Video-Driven Mocap Experiments, Testing SAM 3D Body on video to extract joints and mesh, and what it means for video-to-mocap workflows in Unreal Engine.

SAM 3D Body is a promptable model for single-image full-body 3D human mesh recovery. It estimates pose and shape from one image, supports prompts like keypoints or masks, and outputs a structured human mesh representation that is animation-friendly.

This post documents a quick experiment: running SAM 3D Body frame-by-frame on a video to extract a mesh and joints, then thinking through what that enables for motion capture and video-to-mocap pipelines in Unreal Engine.

Test clip: mesh + joints from video

The video below is a test of me extracting joints and a mesh from a video instead of a single image. Each frame is processed independently, then stitched into a short sequence for visualization.

What changes when you go from image to video

SAM 3D Body is trained for single-image reconstruction, so video inference is essentially a per-frame pipeline. The interesting part is the extra structure video gives you:

  • Temporal consistency: even a simple smoothing pass can reduce frame-to-frame jitter.
  • Occlusion recovery: a limb hidden in one frame often appears in neighboring frames.
  • Motion cues: small motions help resolve ambiguous poses that are hard from one frame alone.

That does not make it a native video model, but it does make the output more usable for animation and tracking.

Implications for motion capture + Unreal Engine motion matching

If you can recover a clean joint hierarchy and mesh per frame, the next steps are familiar:

  1. Stabilize the track -- apply temporal smoothing and fix joint flips.
  2. Retarget -- map the recovered joints to an Unreal Engine-compatible skeleton.
  3. Curate clips -- segment the sequence into motion-matching-ready clips.
  4. Blend and iterate -- feed it into Motion Matching, then tune blend spaces and cost weights.

The big implication: lightweight video-to-mocap workflows become possible without a full marker-based capture setup. You can prototype moves, test blocking, or generate animation references directly from a phone video.

What I'm taking away

  • Single-image models become far more practical when you add temporal context.
  • The mesh output is useful even when you primarily need joints; it gives you pose sanity checks.
  • The Unreal Engine pipeline is viable as soon as the joint stream is stable.

Next experiments

  • Introduce temporal priors to reduce jitter without over-smoothing.
  • Compare a per-frame pipeline vs. a windowed optimization approach.
  • Test retargeting quality on fast turns and foot contacts.

References