Supervisor Meeting · Mar 3, 2026

3D Scene Understanding
for Object Navigation

VGGT exploration results + thesis approach — augmenting VLAs with dense 3D semantic features
VGGT 3D Scene Understanding ObjectNav

UNITE

What Is UNITE?

UNITE Overview — Fig 1 from Koch et al. 2025
Fig 1 from Koch et al. (2025) — UNITE takes posed images and outputs a semantically rich 3D point cloud.

Takes a set of RGB images → reconstructs the full 3D scene → labels every 3D point with what it is, which object it belongs to, and how it moves.

Key Properties

  • Works from RGB only — no depth sensor needed
  • Single forward pass, a few seconds per scene
  • Outputs are language-queryable (CLIP-aligned)
  • Instance-level segmentation in 3D

Per-Point Output

  • 3D coordinates — geometry
  • CLIP features — open-vocabulary semantics
  • Instance ID — object grouping
  • Articulation vector — how parts move

UNITE

How It Works

UNITE Architecture — Fig 2 from Koch et al. 2025
Fig 2 from Koch et al. (2025) — Architecture overview

1. Input

Set of RGB images from different viewpoints of the same scene

2. Backbone (VGGT)

Pre-trained vision transformer encodes images, fuses across views via attention → predicts 3D geometry

3. Semantic Heads (DPT)

Task-specific DPT heads on shared backbone predict semantics, instances, and articulation per point

4. Output

Dense 3D point cloud where each point has: coordinates, CLIP features, instance ID, articulation vector

Foundation

VGGT: Visual Geometry Grounded Transformer CVPR 2025 Best Paper

VGGT Architecture
Fig 2 from Wang et al. — Alternating global/frame-wise attention (no frame index embedding needed).
Result Demo:

Parameter Breakdown (1.26B total)

Aggregator909M72.4%
Camera Head216M17.2%
Depth Head33M2.6%
Point Head33M2.6%
Track Head66M5.2%
Inference: 2.37s/frame (MPS) 64× A100 training
VGGT depth + confidence maps
Depth (plasma) + confidence (viridis) — 5 views
VGGT PCA-RGB backbone features
PCA-RGB backbone features — 5 frames

Landscape

The Research Gap

Line A: VLAs with 3D

Several recent works add 3D to VLAs, but all use simple or implicit 3D:

  • SpatialVLA — 3D position encoding only (RSS 2025)
  • Spatial Forcing — implicit alignment, no 3D at inference (ICLR 2026)
  • 3D-VLA — point clouds but no semantics per point (ICML 2024)
  • eVGGT — distilled VGGT as frozen encoder for ACT/DP manipulation (Vuong et al. 2025)

Line B: Rich 3D Understanding

Foundation models for dense 3D scene understanding exist, but not yet used with semantics for action:

  • UNITE — full semantic 3D from RGB (Koch et al. 2025)
  • VGGT — geometry backbone (CVPR 2025 Best Paper)

The Gap

eVGGT validates VGGT for robotics — but uses latent geometry only.

  • eVGGT: geometry-only, no semantics/CLIP/instance features
  • eVGGT: manipulation only, not navigation
  • UNITE's per-point semantics + instances never used for action
  • World models (DreamerV3) learn from 2D pixels only — never tested with rich 3D features
Proposed contribution: Rich 3D semantic features (VGGT/DPT geometry + CLIP semantics) have never been injected into world models. We test whether DreamerV3's RSSM benefits from dense 3D scene understanding for navigation.

Research Question

Can 3D scene understanding improve world models for navigation?

VGGT DPT Depth CLIP Semantics + DreamerV3 RSSM

Hypothesis

Replacing DreamerV3's CNN encoder with frozen VGGT + DPT features gives the world model an explicit geometric and semantic prior — leading to better sample efficiency and navigation success on HM3D ObjectNav.

Standard DreamerV3 learns all 3D structure implicitly from 2D pixels. We provide it directly via a frozen VGGT backbone.

Background

DreamerV3 — Recurrent State-Space Model (RSSM)

hₑ₋₁ zₑ₋₁ aₑ₋₁ GRU Sequence Model hₑ = f(hₑ₋₁, zₑ₋₁, aₑ₋₁) hₑ Posterior zₑ ∼ q(zₑ | hₑ, xₑ) xₑ Prior źₑ ∼ p(źₑ | hₑ) zₑ (hₑ, zₑ) → heads źₑ Decoder reconstruct x̂ₑ Reward r̂ₑ Continue ĉₑ Actor-Critic learns in imagination aₑ recurrence imagination rollout (no obs)

World Model (RSSM)

  • Deterministic — GRU carries long-range memory via \(h_t\)
  • Stochastic — categorical latent \(z_t\) (32 classes × 32 dims)
  • Posterior uses real observations; Prior imagines without them

Actor-Critic in Imagination

  • Unroll prior 15 steps — no environment needed
  • Actor maximises \(\lambda\)-returns over imagined trajectories
  • Critic predicts discounted value

DreamerV3 Innovations

  • Symlog predictions — scale-free losses across tasks
  • Discrete latents — 32 × 32 categorical (no KL balancing needed)
  • Unimix entropy — 1% uniform mix prevents posterior collapse
  • Single hyperparameter set works across 150+ tasks

Proposed Architecture

VGGT + DPT + DreamerV3 RSSM World Model

Obs Chunk RGB frames [t−B+1 … t] (batch B≈16) VGGT (frozen) Multi-View 3D Feature Extraction — batch over obs chunk DPT BLOCK (MODULAR, EXTENSIBLE) Semantic DPT dense CLIP features (planned) Geometric DPT depth + world pts Fusion Module geometry (⊕ semantics) → cond cond RSSM World Model GRU Seq. Model hₜ = f(hₜ₋₁, zₜ₋₁, aₜ₋₁) deterministic state t−1 Posterior zₜ ~ q(zₜ | hₜ, cond) ↑ uses observation Prior ẑₜ ~ p(ẑₜ | hₜ) imagination only Decoder reconstruct cond Imagination (no env interaction) Reward r̂ₜ Continue ĉₜ Value V(sₜ) Actor-Critic learns here via prior ẑₜ ẑₜ MLP Actor π(aₜ | hₜ, zₜ) DreamerV3 default aₜ Agent Environment HM3D Habitat action aₜ obs + reward

Encoder Selection

VGGT Variant Comparison

Variant Mem100 Mem500 Scaling
VGGT 39.4 GB OOM O(N²)
SceneVGGT 39.4 GB OOM O(N²)
FastVGGT 16.1 GB 54.7 GB O(N)
StreamVGGT 25.0 GB OOM O(N)
InfiniteVGGT 16.6 GB 18.0 GB O(1)
VGGT latency and memory bars
Latency & peak memory at N ∈ {10, 50, 100}
DreamerV3 integration benchmark
DreamerV3 integration: 16 parallel envs, 20 RL steps (16 → 320 frames)
Result: Only InfiniteVGGT guarantees bounded memory at arbitrary episode lengths while maintaining stable throughput (~15 fps).

Encoder Selection

Output Quality — All Variants Equivalent

Depth maps across all VGGT variants
Depth maps from all five variants on identical input frames — outputs are visually indistinguishable.
Selection: InfiniteVGGT selected as 3D encoder — bounded memory, stable throughput, no quality loss.

Benchmark

ObjectNav Task

Task Definition

Agent spawns at random pose in an unseen environment. Given a target object category (e.g., "chair"), navigate to any instance and call STOP within 0.1m.

Observations (Egocentric)

RGB
224×224 egocentric
Depth
224×224 egocentric
GPS+Compass
Relative to start pose
Goal
Category ID (1 of 6)

Action Space

MOVE_FORWARD 0.25m TURN_LEFT 30° TURN_RIGHT 30° STOP

Episode budget: 500 steps max. Success = STOP called within 0.1m of any target instance.

Habitat ObjectNav Demo
Habitat ObjectNav — agent navigates to target object in photorealistic 3D scan

Key Metrics

MetricMeasures
Success RateDid agent find the target?
SPLSuccess weighted by path efficiency
SoftSPLProgress toward goal (partial credit)
DTGDistance to goal at episode end
No map, no oracle: Agent has zero prior knowledge of the environment. Must build spatial understanding from egocentric observations alone.

Benchmark

HM3D Environment & Episode Distribution

HM3D Dataset

  • 1,000 real-world 3D scans of residential spaces
  • Photorealistic rendering via Habitat simulator
  • 216 object categories with semantic annotations
  • Multi-room layouts: kitchens, bedrooms, bathrooms, living rooms, hallways
  • Standard benchmark since 2023 Habitat Challenge

6 Target Object Categories

🪑
Chair
🛏️
Bed
🪴
Plant
🚽
Toilet
📺
TV Monitor
🛋️
Sofa

What's in the Environment

Realistic clutter: furniture, appliances, decorations, doors, stairs. Scenes contain 50–300+ objects per scan. Agent must distinguish target from distractors in cluttered, multi-room layouts with varying lighting and occlusion.

Geodesic Distance Distribution

Shortest navigable path from spawn to nearest target instance (HM3D ObjectNav v2 val split)

0 10% 20% 30% 0-1 1-2 2-3 3-5 5-7 7-10 10-15 15-20 20+ Geodesic Distance (meters) peak: 3-5m (multi-room nav)
Why geodesic distance matters: Most episodes require navigating 3–7m through multiple rooms. This is where 3D spatial understanding becomes critical — the agent must reason about room connectivity and navigate around obstacles, not just recognize objects.

Baseline

DreamerV3 — Pixel-Only World Model

RGB (224×224)
CNN Encoder
RSSM World Model
Actor-Critic
Nav Action

DreamerV3 — Pixel-Only

Baseline · JAX
  • Standard CNN encoder on 2D RGB only
  • Learns geometry implicitly from pixels — no depth, no 3D, no semantics
  • State-of-the-art general RL agent (150+ tasks, Nature 2025)
  • First application to HM3D ObjectNav (no prior work)
  • Same RSSM architecture as our approach — only encoder differs
Our approach: keeps DreamerV3's RSSM unchanged but replaces the CNN encoder with frozen VGGT + DPT features — providing explicit 3D geometry and CLIP semantics that the pixel-only baseline lacks.

Timeline

6-Month Execution Plan

Month 1

BWUniCluster Setup + Baseline

Environment setup on cluster · Start DreamerV3 baseline training · Compare VGGT variants

Month 2

VGGT Integration + Baseline Done

Integrate chosen VGGT variant into DreamerV3 encoder · Finish baseline training

Month 3

Full Pipeline + Semantic Head

Finish full pipeline of our approach · Start training custom semantic head

Month 4

Semantic Head Integration

Integrate new semantic head into pipeline · Ablation runs

Month 5

Iterate + Refine

Improve weakest components · Generalization tests · Buffer for reruns

Month 6

Writing

Thesis writing · Final analysis · Defense prep

GPU budget: ~90 A100-days needed · BWUniCluster available · DreamerV3 + VGGT training on A100
1 / 13