Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

22 April 2026

Adriana Aida

Walida Amer

Katarina Bankovic

Dhruv Behl

Fabian Busch

Annie Bhalla

Minh Duong

Florian Gienger

Rohan Godse

Denis Grachev

Ralf Gulde

Elisa Hagensieker

Junpeng Hu

Shivam Joshi

Tobias Knoblauch

Likith Kumar

Damien LaRocque

Keerthana Lokesh

Omar Moured

Khiem Nguyen

Christian Preyss

Ranjith Sriganesan

Vikram Singh

Carsten Sponner

Anh Tong

Dominik Tuscher

Marc Tuscher

Pavan Upputuri

LM&Ro

VLM

ArXiv (abs)PDF HTML Github (23498★)

Main:16 Pages

13 Figures

Bibliography:4 Pages

8 Tables

Abstract

Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.

View on arXiv

Comments on this paper