NVIDIA AI Introduce SpatialClaw: A Coaching-Free Agent That Treats Code because the Motion Interface for Spatial Reasoning

NVIDIA Analysis has launched SpatialClaw, a training-free framework for spatial reasoning. It targets a persistent weak spot in vision-language fashions (VLMs). These fashions nonetheless wrestle to guage the place objects are, how they relate, and the way they transfer in 3D.

SpatialClaw doesn’t retrain the mannequin. As a substitute, it modifications the motion interface the agent makes use of to name notion instruments. The analysis crew argues the interface is the bottleneck. Their answer is to deal with code because the motion interface. Throughout 20 benchmarks, SpatialClaw reaches 59.9% common accuracy. It outperforms the latest spatial agent SpaceTools by 11.2 factors.

What’s SpatialClaw

SpatialClaw is an agent loop wrapped round a stateful Python kernel. The kernel is pre-loaded with enter frames and a set of primitives. Notion instruments are plain Python callables. Their outputs, together with masks, depth maps, digicam geometry, and trajectories, are abnormal Python variables.

The kernel exposes six public entry factors. InputImages holds the sampled frames. Metadata carries body charge, period, and body indices. instruments exposes notion and geometry primitives. present() embeds a picture into the agent’s subsequent context. vlm dispatches queries to a separate VLM session. ReturnAnswer() submits the ultimate reply.

Two notion instruments are central. instruments.Reconstruct wraps Depth Something 3 and returns per-frame depth, digicam intrinsics, extrinsics, and dense level maps. instruments.SAM3 wraps SAM 3 and produces picture or video masks from textual content, level, or field prompts. The framework provides light-weight utilities: instruments.Geometry, instruments.Masks, instruments.Time, instruments.Graph, and instruments.Draw.

It’s training-free. The identical system immediate, device set, and hyperparameters run throughout each benchmark and spine.

https://spatialclaw.github.io/static/pdfs/spatialclaw.pdf

Why the Motion Interface Issues

The analysis crew studied three motion interfaces on the identical query. Think about measuring the closest distance between a heater and a door.

Single-pass code writes one full program and runs it as soon as. It commits to a full technique earlier than seeing any intermediate masks or depth map. A unsuitable assumption then propagates straight to the reply.
Structured tool-call invokes named instruments by a set JSON schema. It can not freely mix outputs with NumPy or SciPy to precise test-time computations. The closest-point operation has no pre-registered device, so the result’s unsuitable.
SpatialClaw composes instruments in code, inspects outcomes, then revises. It first computes a centroid distance, then notices the centroid makes use of a median. The agent switches to scipy.spatial.KDTree to seek out the true closest level. It submits 0.9439 m in opposition to a 0.9 m floor reality.

Benchmark

SpatialClaw was examined on 20 benchmarks throughout 5 classes. These span single-image, multi-view, normal, video and 4D, and normal video understanding. It improves over the no-tool baseline on all six backbones examined. Backbones vary from 26B to 397B parameters throughout the Qwen3.5/3.6 and Gemma4 households.

A managed comparability isolates the interface. All three variants share the identical toolset and immediate. Solely the motion interface differs.

Motion interface	Avg. (20 bench.)	Δ vs no-tool
No-tool baseline	53.4	–
Single-pass code	55.2	+1.8
Structured tool-call	56.7	+3.3
SpatialClaw (code as motion)	59.9	+6.5

Gemma4-31B spine, 20-benchmark common.

In opposition to prior spatial brokers on the identical Gemma4-31B spine, the hole widens.

Methodology	Interface	Avg.	Δ vs SpatialClaw
VADAR	Single-pass	40.5*	−19.4
pySpatial	Single-pass	47.8	−12.1
SpaceTools-Toolshed	Structured tool-call	48.7	−11.2
SpatialClaw	Code as motion	59.9	greatest

VADAR doesn’t help video or multi-image inputs; solely single-image benchmarks are averaged.

The biggest good points land on dynamic duties. On Gemma4-31B, DSI-Bench rose +17.6 factors and MindCube rose +15.3 factors. These classes want chained geometric computation throughout frames and viewpoints.

An LLM-as-judge attribution explains the wins over structured tool-call. Code composition accounts for 52.2% of them. Management circulate accounts for 19.5%, and the remaining 28.3% are interface-neutral.

Contained in the 5-Stage Loop

Every pattern runs a five-stage loop: planning, code technology, code execution, suggestions meeting, and reply submission. A planner drafts a technique with out seeing the photographs. The principle agent then writes one Python cell per step. A static AST checker rejects unsafe code earlier than execution. The loop repeats till ReturnAnswer() is known as or 30 steps cross.

The official repo runs on a LangGraph workflow and a persistent Jupyter kernel. Backbones serve by vLLM. Notion runs behind a FastAPI GPU service. A single quickstart runs one benchmark on one machine:

git clone --recursive https://github.com/NVlabs/SpatialClaw.git
cd SpatialClaw
bash spatial_agent/scripts/setup.sh
cp .env.instance .env        # add API keys, or self-host vLLM
python -m spatial_agent.entrypoints.run 
    --dataset spatial_agent/config/dataset/erqa.json 
    --model   spatial_agent/config/mannequin/gemini-3-pro.json 
    --concurrency 4

A consultant agent cell composes notion with geometry, then revises:

# Reconstruct the scene, then phase each objects in a single video cross
recon = instruments.Reconstruct.Reconstruct(InputImages)
seg = instruments.SAM3.segment_video_by_text(["radiator heater", "door"])
present(seg.visualize(1))                         # examine the masks first

# Closest-point distance through KD-tree, not centroids
pts_h = seg.get_masked_points(recon, body=1, object=0)   # object 0 = heater
pts_d = seg.get_masked_points(recon, body=2, object=1)   # object 1 = door
dists, _ = scipy.spatial.KDTree(pts_d).question(pts_h, ok=1)
ReturnAnswer(float(dists.min()))

The agent picks primitives from the query itself. Distance questions invoke KD-tree search and vector norms. Route questions depend on dot merchandise. No category-specific routing was utilized.

Use Instances

The design suits issues that want step-by-step geometric reasoning. Concrete examples embody:

Robotics and embodied brokers that measure metric distances between objects earlier than performing.
Multi-view inspection, the place an object’s dealing with path is recovered from a number of digicam angles.
Video and 4D evaluation that tracks object or digicam movement throughout frames.
Indoor scene query answering, akin to “the place is the door relative to the sink?”

As a result of it’s training-free, groups can lengthen a deployed VLM with out new information or fine-tuning.

Interactive Explainer

<span class=""badge"">‘+badge+’</span>‘+s.suppose+’

‘+
‘

'+s.code+'

‘+
‘

‘+s.fb+’

‘;
stream.appendChild(el);
}
// state panel
$(‘#sc-statelbl’).textContent=d.label;
var vb=$(‘#sc-vars’);
if(cur===’single’){
vb.innerHTML=’

‘+d.stateNote+’

‘;
}else if(vars.size===0){
vb.innerHTML=’

‘+d.stateNote+’

‘;
}else{
vb.innerHTML=’

‘+d.stateNote+’

‘+
vars.map(operate(v){return ‘

‘+v.n+’‘+v.t+’

‘}).be a part of(”);
}
// verdict
var vdt=$(‘#sc-verdict’);
var final=d.steps[Math.min(idx,d.steps.length-1)];
if(idx>=d.steps.length-1 && final.ultimate){
vdt.className=”verdict present “+(final.right?’good’:’dangerous’);
vdt.querySelector(‘.mark’).textContent=final.right?’✓’:’✗’;
$(‘#sc-vtxt’).innerHTML=’Submitted reply: ‘+final.reply+(final.right?’ m’:”)+’‘+
‘‘+final.why+’‘;
}else{ vdt.className=”verdict”; }
// controls
$(‘#sc-prev’).disabled=(idx<=0);
$(‘#sc-next’).disabled=(idx>=d.steps.length-1);
$(‘#sc-next’).textContent=(idx>=d.steps.length-1)?’Completed’:’Run subsequent step ▶’;
$(‘#sc-prog’).textContent=”step “+(idx+1)+’ / ‘+d.steps.size;
resize();
}

operate setTab(ok){
cur=ok; idx=0;
root.querySelectorAll(‘.tab’).forEach(operate(t){
t.classList.toggle(‘on’,t.getAttribute(‘data-k’)===ok);
});
render();
}

$(‘#sc-tabs’).addEventListener(‘click on’,operate(e){
var t=e.goal.closest(‘.tab’); if(!t)return; setTab(t.getAttribute(‘data-k’));
});
$(‘#sc-next’).addEventListener(‘click on’,operate(){
if(idx0){idx–;render();}
});
$(‘#sc-reset’).addEventListener(‘click on’,operate(){idx=0;render();});

// auto-resize for WordPress iframe embedding
operate resize(){
strive{
var h=root.offsetHeight+40;
if(window.guardian && window.guardian!==window){
window.guardian.postMessage({sort:’sc-resize’,peak:h},’*’);
}
}catch(e){}
}
window.addEventListener(‘load’,resize);
window.addEventListener(‘resize’,resize);

render();
})();

“>

Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us