SIGMA

Static Scene Reconstruction via Inpainting and Geometry-first Motion Aggregation for Monocular RGB Videos

Carnegie Mellon University
Input Video
4D Static Reconstruction

TL;DR

SIGMA uses geometry-first motion aggregation to generate reliable masks without learned parameters, applies SDXL-based temporal inpainting to remove moving objects and hallucinate clean backgrounds in seconds, and employs sliding-window VGGT reconstruction to fuse cleaned frames into sharp, stable static 3D scenes. We introduce StaDy4D, a paired dynamic/static CARLA dataset for quantitative evaluation.

Overview

Dynamic Input
Stage 1
Geometry-first
Motion Masks
Stage 2
SDXL
Inpainting
Stage 3
Sliding-Window
VGGT
4D Reconstruction
4D Reconstruction

Dataset Gallery

Hover over the slider to compare Dynamic (with moving objects) vs Static (clean background) scenes.

Method

Stage 1: Geometry-first motion masks

  1. Extract SIFT keypoints from \(I_{t-1}, I_t\), match with FLANN + Lowe ratio, estimate \(F_{t-1,t}\) via RANSAC.
  2. Compute dense optical flow \(u_{t-1,t}(p)\), warp \(I_{t-1}\) to \(\tilde I_{t-1}\), measure photometric residual \(r_{\text{photo}}(p)\).
  3. Epipolar residual: \(r_{\text{epi}}(p) = \frac{|x_t^\top F_{t-1,t} x_{t-1}|}{\|(F x_{t-1})_{1:2}\|_2 + \|(F^\top x_t)_{1:2}\|_2 + \varepsilon}\).
  4. Fuse cues: \(r(p) = \alpha r_{\text{epi}} + (1-\alpha) r_{\text{photo}}\); threshold to mask \(M_t\). Show RGB overlays.

Stage 2: Temporal inpainting

Use previous clean estimate \(\hat I_{t-1}\) as guidance: \(\hat I_t = \text{Inp}(I_t, M_t^{\text{rect}}, \hat I_{t-1})\). Implemented with SDXL inpainting + SDXL-Lightning LoRA (8-step sampling); rectangularized masks to reduce artifacts and preserve sky/ground seams. Show before/after triplets.

Stage 3: Streaming static reconstruction

Sliding-window VGGT: \(\{\mathbf{K}_t, \mathbf{R}_t, \mathbf{t}_t, \mathbf{D}_t, \mathcal{T}\}_{t=1}^T = \text{Recon}(\hat{\mathcal{I}}_{t-w:t})\). Fuse depths/tracks into a single static point cloud/mesh; visualize camera frusta + fused cloud. Incremental fusion keeps memory low.

Data: StaDy4D

StaDy4D is generated directly inside CARLA. Every scripted instruction is replayed twice: once with dynamic agents and once with movers removed, keeping camera intrinsics/extrinsics locked. We export RGB, depth, masks, poses, and intrinsic for each clip.

Data Generation Pipeline

Input
Scene: Town03
Path: Forward + Turn
Agents: 20 peds, 10 cars
CARLA
CARLA Engine
Dual Rendering
Paired Outputs
Dynamic RGB
Dynamic Depth
Static RGB
Static Depth
3D Point Cloud
Dynamic 3D
Dynamic 3D
Static 3D
Static 3D

Each scene is rendered twice with identical camera parameters: once with dynamic agents and once without.

Clip statistics

#Scenes #Clips Length (s) FPS Pedestrians Vehicles Camera trajectories
8 20 15 10 80 50 10

Town coverage

Town Description
T1Town near forest
T3Downtown area
T4Small town
T5Urban area
T6Low-density town
T7Rural community
T10Inner-city
T12Large city map

Results

Pipeline Visualization

Intermediate and final outputs showing 2D image-space results and their 3D reconstructions.

Original Frame Dynamic Mask Inpainting Ours Ground Truth
Original Frame 2D Dynamic Mask Inpainting 2D No Output Here Ground Truth 2D
Original Frame 3D No Output Here Inpainting 3D Ours 3D Ground Truth 3D

Interactive 3D Model

Orbit/zoom the fused static 3D reconstruction.

More Results

Citation

@article{
  title={SIGMA: Static Scene Reconstruction via Inpainting and Geometry-first Motion Aggregation for Monocular RGB Videos},
  author={Henry Tsui*, Ethan Lai* and Yu-Rou Tuan*},
  journal={},
  year={2025}
}
SIGMA

Contact

Questions? Email henrytsui@email or open an issue on GitHub.