planning - 2025-06-13

The Urban Model Platform: A Public Backbone for Modeling and Simulation in Urban Digital Twins

Authors:Rico H Herzog, Till Degkwitz, Trivik Verma

Date:2025-06-12 17:58:10

Urban digital twins are increasingly perceived as a way to pool the growing digital resources of cities for the purpose of a more sustainable and integrated urban planning. Models and simulations are central to this undertaking: They enable "what if?" scenarios, create insights and describe relationships between the vast data that is being collected. However, the process of integrating and subsequently using models in urban digital twins is an inherently complex undertaking. It raises questions about how to represent urban complexity, how to deal with uncertain assUrban Model Platformtions and modeling paradigms, and how to capture underlying power relations. Existent approaches in the domain largely focus on monolithic and centralized solutions in the tradition of neoliberal city-making, oftentimes prohibiting pluralistic and open interoperable models. Using a participatory design for participatory systems approach together with the City of Hamburg, Germany, we find that an open Urban Model Platform can function both as a public technological backbone for modeling and simulation in urban digital twins and as a socio-technical framework for a collaborative and pluralistic representation of urban processes. Such a platform builds on open standards, allows for a decentralized integration of models, enables communication between models and supports a multi-model approach to representing urban systems.

Agentic Semantic Control for Autonomous Wireless Space Networks: Extending Space-O-RAN with MCP-Driven Distributed Intelligence

Authors:Eduardo Baena, Paolo Testolina, Michele Polese, Sergi Aliaga, Andrew Benincasa, Dimitrios Koutsonikolas, Josep Jornet, Tommaso Melodia

Date:2025-06-12 17:35:36

Lunar surface operations impose stringent requirements on wireless communication systems, including autonomy, robustness to disruption, and the ability to adapt to environmental and mission-driven context. While Space-O-RAN provides a distributed orchestration model aligned with 3GPP standards, its decision logic is limited to static policies and lacks semantic integration. We propose a novel extension incorporating a semantic agentic layer enabled by the Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication protocols, allowing context-aware decision making across real-time, near-real-time, and non-real-time control layers. Distributed cognitive agents deployed in rovers, landers, and lunar base stations implement wireless-aware coordination strategies, including delay-adaptive reasoning and bandwidth-aware semantic compression, while interacting with multiple MCP servers to reason over telemetry, locomotion planning, and mission constraints.

Dynamic Beyond 5G and 6G Connectivity: Leveraging NTN and RIS Synergies for Optimized Coverage and Capacity in High-Density Environments

Authors:Valdemar Farré, Juan Estrada, David Vega, Luis F Urquiza-Aguiar, Juan A. Vásquez Peralvo, Symeon Chatzinotas

Date:2025-06-12 17:08:27

The increasing demand for reliable, high-capacity communication during large-scale outdoor events poses significant challenges for traditional Terrestrial Networks (TNs), which often struggle to provide consistent coverage in high-density environments. This paper presents a novel 6G radio network planning framework that integrates Non-Terrestrial Networks (NTNs) with Reconfigurable Intelligent Surfaces (RISs) to deliver ubiquitous coverage and enhanced network capacity. Our framework overcomes the limitations of conventional deployable base stations by leveraging NTN architectures, including Low Earth Orbit (LEO) satellites and passive RIS platforms seamlessly integrated with Beyond 5G (B5G) TNs. By incorporating advanced B5G technologies such as Massive Multiple Input Multiple Output (mMIMO) and beamforming, and by optimizing spectrum utilization across the C, S, and Ka bands, we implement a rigorous interference management strategy based on a dynamic SINR model. Comprehensive calculations and simulations validate the proposed framework, demonstrating significant improvements in connectivity, reliability, and cost-efficiency in crowded scenarios. This integration strategy represents a promising solution for meeting the evolving demands of future 6G networks.

CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Authors:Alireza Salemi, Mukta Maddipatla, Hamed Zamani

Date:2025-06-12 16:02:29

This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.

A Robust Optimization Framework for Flexible Industrial Energy Scheduling: Application to a Cement Plant with Market Participation

Authors:Sebastián Rojas-Innocenti, Enrique Baeyens, Alejandro Martín-Crespo, Sergio Saludes-Rodil, Fernando Frechoso Escudero

Date:2025-06-12 15:44:29

This paper presents a scenario based robust optimization framework for short term energy scheduling in electricity intensive industrial plants, explicitly addressing uncertainty in planning decisions. The model is formulated as a two-stage Mixed Integer Linear Program (MILP) and integrates a hybrid scenario generation method capable of representing uncertain inputs such as electricity prices, renewable generation, and internal demand. A convex objective function combining expected and worst case operational costs allows for tunable risk aversion, enabling planners to balance economic performance and robustness. The resulting schedule ensures feasibility across all scenarios and supports coordinated use of industrial flexibility assets, including battery energy storage and shiftable production. To isolate the effects of market volatility, the framework is applied to a real world cement manufacturing case study considering only day-ahead electricity price uncertainty, with all other inputs treated deterministically. Results show improved resilience to forecast deviations, reduced cost variability, and more consistent operations. The proposed method offers a scalable and risk-aware approach for industrial flexibility planning under uncertainty.

Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation

Authors:Nicholas Summerfield, Qisheng He, Alex Kuo, Ahmed I. Ghanem, Simeng Zhu, Chase Ruff, Joshua Pan, Anudeep Kumar, Prashant Nagpal, Jiwei Zhao, Ming Dong, Carri K. Glide-Hurst

Date:2025-06-12 15:10:24

Cardiac substructures are essential in thoracic radiation therapy planning to minimize risk of radiation-induced heart disease. Deep learning (DL) offers efficient methods to reduce contouring burden but lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net-based, U-shaped backbone conserving the function of a single model. Twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT angiography (CCTA) modalities were manually delineated and used to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison models (four segmentation subgroups across three modalities) were equivalently trained. All methods were compared for training efficiency and against reference contours using the Dice Similarity Coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC outperforms the comparison in 57% of cases, with limited statistical differences. MAGIC offers an effective and accurate segmentation solution that is lightweight and capable of segmenting multiple modalities and overlapping structures in a single model. MAGIC further enables clinical implementation by simplifying the computational requirements and offering unparalleled flexibility for clinical settings.

Sampling-Based Planning Under STL Specifications: A Forward Invariance Approach

Authors:Gregorio Marchesini, Siyuan Liu, Lars Lindemann, Dimos V. Dimarogonas

Date:2025-06-12 14:27:35

We propose a variant of the Rapidly Exploring Random Tree Star (RRT$^{\star}$) algorithm to synthesize trajectories satisfying a given spatio-temporal specification expressed in a fragment of Signal Temporal Logic (STL) for linear systems. Previous approaches for planning trajectories under STL specifications using sampling-based methods leverage either mixed-integer or non-smooth optimization techniques, with poor scalability in the horizon and complexity of the task. We adopt instead a control-theoretic perspective on the problem, based on the notion of set forward invariance. Specifically, from a given STL task defined over polyhedral predicates, we develop a novel algorithmic framework by which the task is efficiently encoded into a time-varying set via linear programming, such that trajectories evolving within the set also satisfy the task. Forward invariance properties of the resulting set with respect to the system dynamics and input limitations are then proved via non-smooth analysis. We then present a modified RRT$^{\star}$ algorithm to synthesize asymptotically optimal and dynamically feasible trajectories satisfying a given STL specification, by sampling a tree of trajectories within the previously constructed time-varying set. We showcase two use cases of our approach involving an autonomous inspection of the International Space Station and room-servicing task requiring timed revisit of a charging station.

Deep Learning-based Multi Project InP Wafer Simulation for Unsupervised Surface Defect Detection

Authors:Emílio Dolgener Cantú, Rolf Klemens Wittmann, Oliver Abdeen, Patrick Wagner, Wojciech Samek, Moritz Baier, Sebastian Lapuschkin

Date:2025-06-12 14:03:10

Quality management in semiconductor manufacturing often relies on template matching with known golden standards. For Indium-Phosphide (InP) multi-project wafer manufacturing, low production scale and high design variability lead to such golden standards being typically unavailable. Defect detection, in turn, is manual and labor-intensive. This work addresses this challenge by proposing a methodology to generate a synthetic golden standard using Deep Neural Networks, trained to simulate photo-realistic InP wafer images from CAD data. We evaluate various training objectives and assess the quality of the simulated images on both synthetic data and InP wafer photographs. Our deep-learning-based method outperforms a baseline decision-tree-based approach, enabling the use of a 'simulated golden die' from CAD plans in any user-defined region of a wafer for more efficient defect detection. We apply our method to a template matching procedure, to demonstrate its practical utility in surface defect detection.

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Authors:Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie

Date:2025-06-12 06:21:19

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/

NeuroPAL: Punctuated Anytime Learning with Neuroevolution for Macromanagement in Starcraft: Brood War

Authors:Jim O'Connor, Yeonghun Lee, Gary B Parker

Date:2025-06-12 06:19:27

StarCraft: Brood War remains a challenging benchmark for artificial intelligence research, particularly in the domain of macromanagement, where long-term strategic planning is required. Traditional approaches to StarCraft AI rely on rule-based systems or supervised deep learning, both of which face limitations in adaptability and computational efficiency. In this work, we introduce NeuroPAL, a neuroevolutionary framework that integrates Neuroevolution of Augmenting Topologies (NEAT) with Punctuated Anytime Learning (PAL) to improve the efficiency of evolutionary training. By alternating between frequent, low-fidelity training and periodic, high-fidelity evaluations, PAL enhances the sample efficiency of NEAT, enabling agents to discover effective strategies in fewer training iterations. We evaluate NeuroPAL in a fixed-map, single-race scenario in StarCraft: Brood War and compare its performance to standard NEAT-based training. Our results show that PAL significantly accelerates the learning process, allowing the agent to reach competitive levels of play in approximately half the training time required by NEAT alone. Additionally, the evolved agents exhibit emergent behaviors such as proxy barracks placement and defensive building optimization, strategies commonly used by expert human players. These findings suggest that structured evaluation mechanisms like PAL can enhance the scalability and effectiveness of neuroevolution in complex real-time strategy environments.

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Authors:Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie

Date:2025-06-12 05:29:40

Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/

New Approximation Guarantees for The Inventory Staggering Problem

Authors:Noga Alon, Danny Segev

Date:2025-06-12 04:29:43

Since its inception in the mid-60s, the inventory staggering problem has been explored and exploited in a wide range of application domains, such as production planning, stock control systems, warehousing, and aerospace/defense logistics. However, even with a rich history of academic focus, we are still very much in the dark when it comes to cornerstone computational questions around inventory staggering and to related structural characterizations, with our methodological toolbox being severely under-stocked. The central contribution of this paper consists in devising a host of algorithmic techniques and analytical ideas -- some being entirely novel and some leveraging well-studied concepts in combinatorics and number theory -- for surpassing essentially all known approximation guarantees for the inventory staggering problem. In particular, our work demonstrates that numerous structural properties open the door for designing polynomial-time approximation schemes, including polynomially-bounded cycle lengths, constantly-many distinct time intervals, so-called nested instances, and pairwise coprime settings. These findings offer substantial improvements over currently available constant-factor approximations and resolve outstanding open questions in their respective contexts. In parallel, we develop new theory around a number of yet-uncharted questions, related to the sampling complexity of peak inventory estimation as well as to the plausibility of groupwise synchronization. Interestingly, we establish the global nature of inventory staggering, proving that there are $n$-item instances where, for every subset of roughly $\sqrt{n}$ items, no policy improves on the worst-possible one by a factor greater than $1+\epsilon$, whereas for the entire instance, there exists a policy that outperforms the worst-possible one by a factor of nearly $2$, which is optimal.

Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework

Authors:Sadia Kamal, Tim Oates, Joy Wan

Date:2025-06-12 03:33:46

Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.

Dynamic Less-Than-Truckload Transportation Planning in Hyperconnected Hub Networks with Multi-Carrier Operations

Authors:Tiankuo Zhang, Jingze Li, Benoit Montreuil

Date:2025-06-12 02:04:35

Less-than-truckload (LTL) shipment is vital in modern freight transportation yet is in dire need of more efficient usage of resources, higher service responsiveness and velocity, lower overall shipping cost across all parties, and better quality of life for the drivers. The industry is currently highly fragmented, with numerous small to medium-sized LTL carriers typically operating within dedicated regions or corridors, mostly disconnected from each other. This paper investigates the large-scale interconnection of LTL carriers enabling each to leverage multi-carrier networks for cross-region services exploiting their mutual logistic hubs, in line with Physical Internet principles. In such a network, efficient open cooperation strategies are critical for optimizing multiparty relay shipment consolidation and delivery, transport and logistic operations and orchestration, and enabling inter-hub driver short hauls. To dynamically plan relay truck transportation of involved carriers across hyperconnected hub networks, we develop an optimization-based model to build loads, coordinate shipments, and synchronize driver deliveries. We report a simulation-based experiment in a multiparty LTL network covering the eastern U.S. in three scenarios: 1) each carrier operates separately and serves its clients with end-to-end transportation, 2) each carrier operates separately and adopts relay transportation in its service region, and 3) all carriers operate jointly and serve clients in the multi-carrier hyperconnected relay network. By comparing these three scenarios, we evaluate the impact of relay transportation and carrier cooperations on cost savings, trip duration, and greenhouse gas emissions. Overall, this research advances operational efficiencies through an effective collaborative solution across the LTL industry and contributes to the pursuit of sustainable logistics networks.

A Navigation Framework Utilizing Vision-Language Models

Authors:Yicheng Duan, Kaiyu tang

Date:2025-06-11 20:51:58

Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.

Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective

Authors:Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai, Shizheng Wang, Yang Long, Yefeng Zheng

Date:2025-06-11 19:44:51

Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48\% (ranging from 2.39\% to 7.72\%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

Authors:Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

Date:2025-06-11 19:36:17

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained to play the puzzle game Sokoban with model-free reinforcement learning. Prior work found that this network solves more levels with more test-time compute. Our analysis reveals several mechanisms analogous to components of classic bidirectional search. For each square, the RNN represents its plan in the activations of channels associated with specific directions. These state-action activations are analogous to a value function - their magnitudes determine when to backtrack and which plan branch survives pruning. Specialized kernels extend these activations (containing plan and value) forward and backward to create paths, forming a transition model. The algorithm is also unlike classical search in some ways. State representation is not unified; instead, the network considers each box separately. Each layer has its own plan representation and value function, increasing search depth. Far from being inscrutable, the mechanisms leveraging test-time compute learned in this network by model-free training can be understood in familiar terms.

Patient-Specific Deep Reinforcement Learning for Automatic Replanning in Head-and-Neck Cancer Proton Therapy

Authors:Malvern Madondo, Yuan Shao, Yingzi Liu, Jun Zhou, Xiaofeng Yang, Zhen Tian

Date:2025-06-11 18:00:06

Anatomical changes during intensity-modulated proton therapy (IMPT) for head-and-neck cancer (HNC) can shift Bragg peaks, risking tumor underdosing and organ-at-risk overdosing. As a result, treatment replanning is often required to maintain clinically acceptable treatment quality. However, current manual replanning processes are resource-intensive and time-consuming. We propose a patient-specific deep reinforcement learning (DRL) framework for automated IMPT replanning, with a reward-shaping mechanism based on a $150$-point plan quality score addressing competing clinical objectives. We formulate the planning process as an RL problem where agents learn control policies to adjust optimization priorities, maximizing plan quality. Unlike population-based approaches, our framework trains personalized agents for each patient using their planning CT (Computed Tomography) and augmented anatomies simulating anatomical changes (tumor progression and regression). This patient-specific approach leverages anatomical similarities throughout treatment, enabling effective plan adaptation. We implemented two DRL algorithms, Deep Q-Network and Proximal Policy Optimization, using dose-volume histograms (DVHs) as state representations and a $22$-dimensional action space of priority adjustments. Evaluation on five HNC patients using actual replanning CT data showed both DRL agents improved initial plan scores from $120.63 \pm 21.40$ to $139.78 \pm 6.84$ (DQN) and $142.74 \pm 5.16$ (PPO), surpassing manual replans generated by a human planner ($137.20 \pm 5.58$). Clinical validation confirms that improvements translate to better tumor coverage and OAR sparing across diverse anatomical changes. This work demonstrates DRL's potential in addressing geometric and dosimetric complexities of adaptive proton therapy, offering efficient offline adaptation solutions and advancing online adaptive proton therapy.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Authors:Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas

Date:2025-06-11 17:57:09

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

ReSim: Reliable World Simulation for Autonomous Driving

Authors:Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen

Date:2025-06-11 17:55:05

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

Authors:Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, Justine T. Kao

Date:2025-06-11 17:10:36

We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Authors:Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng

Date:2025-06-11 16:52:18

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

From Theory to Practice: Advancing Multi-Robot Path Planning Algorithms and Applications

Authors:Teng Guo

Date:2025-06-11 16:29:42

The labeled MRPP (Multi-Robot Path Planning) problem involves routing robots from start to goal configurations efficiently while avoiding collisions. Despite progress in solution quality and runtime, its complexity and industrial relevance continue to drive research. This dissertation introduces scalable MRPP methods with provable guarantees and practical heuristics. First, we study dense MRPP on 2D grids, relevant to warehouse and parcel systems. We propose the Rubik Table method, achieving $(1 + \delta)$-optimal makespan (with $\delta \in (0, 0.5]$) for up to $\frac{m_1 m_2}{2}$ robots, solving large instances efficiently and setting a new theoretical benchmark. Next, we address real-world MRPP. We design optimal layouts for structured environments (e.g., warehouses, parking systems) and propose a puzzle-based system for dense, deadlock-free autonomous vehicle parking. We also extend MRPP to Reeds-Shepp robots, introducing motion primitives and smoothing techniques to ensure feasible, efficient paths under nonholonomic constraints. Simulations and real-world tests validate the approach in urban driving and robotic transport scenarios.

Entanglement structure for finite system under dual-unitary dynamics

Authors:Gaurav Rudra Malik, Rohit Kumar Shukla, Sudhanva Joshi, S. Aravinda, Sunil Kumar Mishra

Date:2025-06-11 16:17:17

The dynamics of quantum many-body systems in the chaotic regime are of particular interest due to the associated phenomena of information scrambling and entanglement generation within the system. While these systems are typically intractable using traditional numerical methods, an effective framework can be implemented based on dual-unitary circuits which have emerged as a minimal model for maximally chaotic dynamics. In this work, we investigate how individual two-body operators influence the global dynamics of circuits composed of dual-unitaries. We study their effect on entanglement generation while examining it from both bipartite and multipartite perspectives. Here we also highlight the significant role of local unitaries in the dynamics when paired with operators from the dual-unitary class, showing that systems with identical entangling power can exhibit a range of differing entanglement growth rates. Furthermore, we present calculations establishing time-step-dependent lower bounds, which depend on both the initial state and the entangling power of the constituent operators. Finally, we find that time-evolving an initial state composed of pair products generates a state with nearly maximal multipartite entanglement content, approaching the bounds established by Absolutely Maximally Entangled (AME) states.

"What are my options?": Explaining RL Agents with Diverse Near-Optimal Alternatives (Extended)

Authors:Noel Brindise, Vijeth Hebbar, Riya Shah, Cedric Langbort

Date:2025-06-11 16:15:56

In this work, we provide an extended discussion of a new approach to explainable Reinforcement Learning called Diverse Near-Optimal Alternatives (DNA), first proposed at L4DC 2025. DNA seeks a set of reasonable "options" for trajectory-planning agents, optimizing policies to produce qualitatively diverse trajectories in Euclidean space. In the spirit of explainability, these distinct policies are used to "explain" an agent's options in terms of available trajectory shapes from which a human user may choose. In particular, DNA applies to value function-based policies on Markov decision processes where agents are limited to continuous trajectories. Here, we describe DNA, which uses reward shaping in local, modified Q-learning problems to solve for distinct policies with guaranteed epsilon-optimality. We show that it successfully returns qualitatively different policies that constitute meaningfully different "options" in simulation, including a brief comparison to related approaches in the stochastic optimization field of Quality Diversity. Beyond the explanatory motivation, this work opens new possibilities for exploration and adaptive planning in RL.

Hierarchical Learning-Enhanced MPC for Safe Crowd Navigation with Heterogeneous Constraints

Authors:Huajian Liu, Yixuan Feng, Wei Dong, Kunpeng Fan, Chao Wang, Yongzhuo Gao

Date:2025-06-11 15:31:25

In this paper, we propose a novel hierarchical framework for robot navigation in dynamic environments with heterogeneous constraints. Our approach leverages a graph neural network trained via reinforcement learning (RL) to efficiently estimate the robot's cost-to-go, formulated as local goal recommendations. A spatio-temporal path-searching module, which accounts for kinematic constraints, is then employed to generate a reference trajectory to facilitate solving the non-convex optimization problem used for explicit constraint enforcement. More importantly, we introduce an incremental action-masking mechanism and a privileged learning strategy, enabling end-to-end training of the proposed planner. Both simulation and real-world experiments demonstrate that the proposed method effectively addresses local planning in complex dynamic environments, achieving state-of-the-art (SOTA) performance. Compared with existing learning-optimization hybrid methods, our approach eliminates the dependency on high-fidelity simulation environments, offering significant advantages in computational efficiency and training scalability. The code will be released as open-source upon acceptance of the paper.

Noise in Maps of the Sun at Radio Wavelengths II: Solar Use Cases

Authors:Timothy Bastian, Bin Chen, Surajit Mondal, Pascal Saint-Hilaire

Date:2025-06-11 15:19:41

Noise in images of strong celestial sources at radio wavelengths using Fourier synthesis arrays can be dominated by the source itself, so-called self-noise. We outlined the theory of self-noise for strong sources in a companion paper. Here we consider the case of noise in maps of radio emission from the Sun which, as we show, is always dominated by self noise. We consider several classes of science use cases for current and planned arrays designed to observe the Sun in order to understand limitations imposed by self-noise. We focus on instruments operating at decimeter and centimeter wavelengths but the results are applicable to other wavelength regimes.

Automatic Treatment Planning using Reinforcement Learning for High-dose-rate Prostate Brachytherapy

Authors:Tonghe Wang, Yining Feng, Xiaofeng Yang

Date:2025-06-11 14:46:42

Purpose: In high-dose-rate (HDR) prostate brachytherapy procedures, the pattern of needle placement solely relies on physician experience. We investigated the feasibility of using reinforcement learning (RL) to provide needle positions and dwell times based on patient anatomy during pre-planning stage. This approach would reduce procedure time and ensure consistent plan quality. Materials and Methods: We train a RL agent to adjust the position of one selected needle and all the dwell times on it to maximize a pre-defined reward function after observing the environment. After adjusting, the RL agent then moves on to the next needle, until all needles are adjusted. Multiple rounds are played by the agent until the maximum number of rounds is reached. Plan data from 11 prostate HDR boost patients (1 for training, and 10 for testing) treated in our clinic were included in this study. The dosimetric metrics and the number of used needles of RL plan were compared to those of the clinical results (ground truth). Results: On average, RL plans and clinical plans have very similar prostate coverage (Prostate V100) and Rectum D2cc (no statistical significance), while RL plans have less prostate hotspot (Prostate V150) and Urethra D20% plans with statistical significance. Moreover, RL plans use 2 less needles than clinical plan on average. Conclusion: We present the first study demonstrating the feasibility of using reinforcement learning to autonomously generate clinically practical HDR prostate brachytherapy plans. This RL-based method achieved equal or improved plan quality compared to conventional clinical approaches while requiring fewer needles. With minimal data requirements and strong generalizability, this approach has substantial potential to standardize brachytherapy planning, reduce clinical variability, and enhance patient outcomes.

Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving

Authors:Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv

Date:2025-06-11 14:42:11

End-to-end autonomous driving has emerged as a promising paradigm for directly mapping sensor inputs to planning maneuvers using learning-based modular integrations. However, existing imitation learning (IL)-based models suffer from generalization to hard cases, and a lack of corrective feedback loop under post-deployment. While reinforcement learning (RL) offers a potential solution to tackle hard cases with optimality, it is often hindered by overfitting to specific driving cases, resulting in catastrophic forgetting of generalizable knowledge and sample inefficiency. To overcome these challenges, we propose Reinforced Refinement with Self-aware Expansion (R2SE), a novel learning pipeline that constantly refines hard domain while keeping generalizable driving policy for model-agnostic end-to-end driving systems. Through reinforcement fine-tuning and policy expansion that facilitates continuous improvement, R2SE features three key components: 1) Generalist Pretraining with hard-case allocation trains a generalist imitation learning (IL) driving system while dynamically identifying failure-prone cases for targeted refinement; 2) Residual Reinforced Specialist Fine-tuning optimizes residual corrections using reinforcement learning (RL) to improve performance in hard case domain while preserving global driving knowledge; 3) Self-aware Adapter Expansion dynamically integrates specialist policies back into the generalist model, enhancing continuous performance improvement. Experimental results in closed-loop simulation and real-world datasets demonstrate improvements in generalization, safety, and long-horizon policy robustness over state-of-the-art E2E systems, highlighting the effectiveness of reinforce refinement for scalable autonomous driving.

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Authors:Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang

Date:2025-06-11 14:35:15

AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.