planning - 2026-01-11

Learning Latent Action World Models In The Wild

Authors:Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat
Date:2026-01-08 18:55:39

Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

Sparsity and uniform regularity for regularised optimal transport

Authors:Rishabh S. Gvalani, Lukas Koch
Date:2026-01-08 17:20:00

We consider regularised quadratic optimal transport with subquadratic polynomial or entropic regularisation. In both cases, we prove interior Lipschitz-estimates on a transport-like map and interior gradient Lipschitz-estimates on the potentials, under the assumption that the transport map solving the unregularised problem is bi-$C^{1,α}$-regular. For strictly subquadratic and entropic regularisation, the estimates improve to interior $C^1$ and $C^2$ estimates for the transport-like map and the potentials, respectively. Our estimates are uniform in the regularisation parameter. As a consequence of this, we obtain convergence of the transport-like map (resp. the potentials) to the unregularised transport map (resp. Kantorovich potentials) in $C^{0,1-}_{\mathrm{loc}}$ (resp. $C^{1,1-}_{\mathrm{loc}}$). Central to our approach are sharp local bounds on the size of the support for regularised optimal transport which we derive for a general convex, superlinear regularisation term. These bounds are of independent interest and imply global bias bounds for the regularised transport plans. Our global bounds, while not necessarily sharp, improve on the best known results in the literature for quadratic regularisation.

Dosimetric Impact of Hidden Input Parameters in Inverse Optimization Algorithms for GYN HDR Brachytherapy

Authors:YeongHyeon Park, Shiqin Su, Sarath Vijayan, Zhiqian Henry Yu, Mandy Cunningham, Yusung Kim
Date:2026-01-08 15:51:24

Inverse optimization (IO) algorithms are used in GYN HDR brachytherapy planning, with user parameter settings embedded in commercial TPS. To examine the dosimetric influence of hidden input parameters in three IO algorithms-IPSA, HIPO, and MCO-for GYN HDR brachytherapy across two applicator types. In-house implementations of IPSA, HIPO, and MCO were implemented and evaluated against retrospectively generated commercial TPS plans (Oncentra Brachy) using identical clinical input parameters across 24 cervical cancer cases (18 T&O; 6 T&O+Needles (T&O+N)). Each IO algorithm was assessed using 1k combinations of hidden parameters (e.g., dwell-time modulation constraints, convergence thresholds). Cumulative DVH curves and dosimetric indices (HR-CTV D98/D90, OAR D2cc) were compared with commercial plans. Standard deviations (SD) of DVH differences were used to characterize sensitivity to hidden parameters. For HR-CTV, SD values in T&O+N cases reached 23.0 Gy and 7.1 Gy for MCO and HIPO, respectively, with corresponding average values of 55.8 Gy and 19.7 Gy. In T&O cases, HR-CTV SD values reached 4.9 Gy and 3.3 Gy for HIPO and IPSA, respectively, with average values of 20.1 Gy and 8.6 Gy. MCO exhibited the highest sensitivity, followed by HIPO and IPSA. T&O+N cases showed greater sensitivity than T&O cases. Absolute differences in HR-CTV D90 (D98) relative to commercial algorithms reached up to 33.3 Gy (28.4) for T&O+N cases and 10.8 Gy (8.5) for T&O cases. For OARs, absolute D2cc differences in T&O+N (T&O) cases reached up to 8.6 Gy (2.3) for rectum, 17 Gy (10.2) for bladder, 14.8 Gy (3.9) for sigmoid, and 7.0 Gy (8.1) for bowel. Hidden input parameter settings significantly impact on GYN HDR plans, with target coverage up to 28.4 Gy across IO algorithms for both T&O and T&O+N cases. The findings in this study shown the potential to improve plans through hidden input parameter optimization.

From Stories to Cities to Games: A Qualitative Evaluation of Behaviour Planning

Authors:Mustafa F. Abdelwahed, Joan Espasa, Alice Toniolo, Ian P. Gent
Date:2026-01-08 13:09:43

The primary objective of a diverse planning approach is to generate a set of plans that are distinct from one another. Such an approach is applied in a variety of real-world domains, including risk management, automated stream data analysis, and malware detection. More recently, a novel diverse planning paradigm, referred to as behaviour planning, has been proposed. This approach extends earlier methods by explicitly incorporating a diversity model into the planning process and supporting multiple planning categories. In this paper, we demonstrate the usefulness of behaviour planning in real-world settings by presenting three case studies. The first case study focuses on storytelling, the second addresses urban planning, and the third examines game evaluation.

Precomputing Multi-Agent Path Replanning using Temporal Flexibility: A Case Study on the Dutch Railway Network

Authors:Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt
Date:2026-01-08 12:30:36

Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not result in an efficient plan, and sometimes cannot even yield a feasible plan. On the other hand, replanning other agents may lead to a cascade of changes and delays. We show how to efficiently replan by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay an agent can take without changing the order of or further delaying more agents. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent, also returning the changes for the other agents, for any single-agent delay within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network. Our experiments show that FlexSIPP provides effective solutions, relevant to real-world adjustments, and within a reasonable timeframe.

Bi-level Multi-criteria Optimization for Risk-informed Radiotherapy

Authors:Mara Schubert, Katrin Teichert, Zhongxing Liao, Thomas Bortfeld, Ali Ajdari
Date:2026-01-08 10:57:53

In radiation therapy (RT) treatment planning, multi-criteria optimization (MCO) supports efficient plan selection but is usually solved for population-based dosimetric criteria and ignores patient-specific biological risk, potentially compromising outcomes in high-risk patients. We propose risk-guided MCO, a one-shot method that embeds a clinical risk model into conventional MCO, enabling interactive navigation between dosimetric and biological endpoints. The proposed algorithm uses a special order relation to fuse the classical MCO sandwiching algorithm with bi-level optimization, restricting the Pareto set to plans that achieve improvement in the secondary risk objective for user-defined, acceptable loss in primary clinical objectives. Thus, risk-guided MCO generates risk-optimized counterparts of clinical plans in a single run rather than by sequential or lexicographic planning. To assess the performance, we retrospectively analyzed 19 lung cancer patients treated with RT. The endpoint was the risk of grade 2+ radiation pneumonitis (RP), modeled using bootstrapped stepwise logistics regression with interaction terms, including baseline lung function, smoking history, and dosimetric factors. The risk-guided plans yielded a mean reduction of 8.0% in total lung V20 and 9.5% in right lung V5, translating into an average RP risk reduction of 7.7% (range=0.3%-20.1%), with small changes in target coverage (mean -1.2 D98[%] for CTV) and modest increase in heart dose (mean +1.74 Gy). This study presents the first proof-of-concept for integrating biological risk models directly within multi-criteria RT planning, enabling an interactive balance between established population-wide dose protocols and individualized outcome prediction. Our results demonstrate that the risk-informed MCO can reduce the risk of RP while maintaining target coverage.

SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning

Authors:Zebin Han, Xudong Wang, Baichen Liu, Qi Lyu, Zhenduo Shang, Jiahua Dong, Lianqing Liu, Zhi Han
Date:2026-01-08 08:09:24

Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task navigation guided by complex, long-horizon language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such multi-task instructions, as information overload impairs the agent's ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a navigation model built on a hierarchical planning framework. Our SeqWalker features: i) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent's current visual observations, thus reducing cognitive load; ii) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the superiority of the proposed SeqWalker.

TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

Authors:Yinuo Wang, Mining Tan, Wenxiang Jiao, Xiaoxi Li, Hao Wang, Xuanyu Zhang, Yuan Lu, Weiming Dong
Date:2026-01-08 08:08:35

Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs' set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.

Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning

Authors:Enze Pan
Date:2026-01-08 08:05:42

We present Tape, a controlled reinforcement-learning benchmark designed to isolate out-of-distribution (OOD) failure under latent rule shifts.Tape is derived from one-dimensional cellular automata, enabling precise train/test splits where observation and action spaces are held fixed while transition rules change. Using a reproducible evaluation pipeline, we compare model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. A consistent pattern emerges: methods that are strong in-distribution (ID) can collapse under heldout-rule OOD, and high-variance OOD evaluation can make rankings unstable unless experiments are sufficiently replicated.We provide (i) standardized OOD protocols, (ii) statistical reporting requirements (seeds, confidence intervals, and hypothesis tests), and (iii) information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what "uncertainty reduction" objectives can and cannot guarantee under rule shifts.

Nightmare Dreamer: Dreaming About Unsafe States And Planning Ahead

Authors:Oluwatosin Oseni, Shengjie Wang, Jun Zhu, Micah Corah
Date:2026-01-08 07:55:07

Reinforcement Learning (RL) has shown remarkable success in real-world applications, particularly in robotics control. However, RL adoption remains limited due to insufficient safety guarantees. We introduce Nightmare Dreamer, a model-based Safe RL algorithm that addresses safety concerns by leveraging a learned world model to predict potential safety violations and plan actions accordingly. Nightmare Dreamer achieves nearly zero safety violations while maximizing rewards. Nightmare Dreamer outperforms model-free baselines on Safety Gymnasium tasks using only image observations, achieving nearly a 20x improvement in efficiency.

Optimizing Path Planning using Deep Reinforcement Learning for UGVs in Precision Agriculture

Authors:Laukik Patade, Rohan Rane, Sandeep Pillai
Date:2026-01-08 07:28:11

This study focuses on optimizing path planning for unmanned ground vehicles (UGVs) in precision agriculture using deep reinforcement learning (DRL) techniques in continuous action spaces. The research begins with a review of traditional grid-based methods, such as A* and Dijkstra's algorithms, and discusses their limitations in dynamic agricultural environments, highlighting the need for adaptive learning strategies. The study then explores DRL approaches, including Deep Q-Networks (DQN), which demonstrate improved adaptability and performance in two-dimensional simulations. Enhancements such as Double Q-Networks and Dueling Networks are evaluated to further improve decision-making. Building on these results, the focus shifts to continuous action space models, specifically Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3), which are tested in increasingly complex environments. Experiments conducted in a three-dimensional environment using ROS and Gazebo demonstrate the effectiveness of continuous DRL algorithms in navigating dynamic agricultural scenarios. Notably, the pretrained TD3 agent achieves a 95 percent success rate in dynamic environments, demonstrating the robustness of the proposed approach in handling moving obstacles while ensuring safety for both crops and the robot.

Cluster-Based Bayesian SIRD Modeling of Chickenpox Epidemiology in India

Authors:Nayana Mukherjee, Chitradipa Chakraborty
Date:2026-01-08 06:41:45

This study presents a cluster-based Bayesian SIRD model to analyze the epidemiology of chickenpox (varicella) in India, utilizing data from 1990 to 2021. We employed an age-structured approach, dividing the population into juvenile, adult, and elderly groups, to capture the disease's transmission dynamics across diverse demographic groups. The model incorporates a Holling-type incidence function, which accounts for the saturation effect of transmission at high prevalence levels, and applies Bayesian inference to estimate key epidemiological parameters, including transmission rates, recovery rates, and mortality rates. The study further explores cluster analysis to identify regional clusters within India based on the similarities in chickenpox transmission dynamics, using criteria like incidence, prevalence, and mortality rates. We perform K-means clustering to uncover three distinct epidemiological regimes, which vary in terms of outbreak potential and age-specific dynamics. The findings highlight juveniles as the primary drivers of transmission, while the elderly face a disproportionately high mortality burden. Our results underscore the importance of age-targeted interventions and suggest that regional heterogeneity should be considered in public health strategies for disease control. The model offers a transparent, reproducible framework for understanding long-term transmission dynamics and supports evidence-based planning for chickenpox control in India. The practical utility of the model is further validated through a simulation study.

Adaptive Retrieval for Reasoning-Intensive Retrieval

Authors:Jongho Kim, Jaeyoung Kim, Seung-won Hwang, Jihyuk Kim, Yu Jin Kim, Moontae Lee
Date:2026-01-08 05:46:50

We study leveraging adaptive retrieval to ensure sufficient "bridge" documents are retrieved for reasoning-intensive retrieval. Bridge documents are those that contribute to the reasoning process yet are not directly relevant to the initial query. While existing reasoning-based reranker pipelines attempt to surface these documents in ranking, they suffer from bounded recall. Naive solution with adaptive retrieval into these pipelines often leads to planning error propagation. To address this, we propose REPAIR, a framework that bridges this gap by repurposing reasoning plans as dense feedback signals for adaptive retrieval. Our key distinction is enabling mid-course correction during reranking through selective adaptive retrieval, retrieving documents that support the pivotal plan. Experimental results on reasoning-intensive retrieval and complex QA tasks demonstrate that our method outperforms existing baselines by 5.6%pt.

Autonomous Agents on Blockchains: Standards, Execution Models, and Trust Boundaries

Authors:Saad Alqithami
Date:2026-01-08 04:29:26

Advances in large language models have enabled agentic AI systems that can reason, plan, and interact with external tools to execute multi-step workflows, while public blockchains have evolved into a programmable substrate for value transfer, access control, and verifiable state transitions. Their convergence introduces a high-stakes systems challenge: designing standard, interoperable, and secure interfaces that allow agents to observe on-chain state, formulate transaction intents, and authorize execution without exposing users, protocols, or organizations to unacceptable security, governance, or economic risks. This survey systematizes the emerging landscape of agent-blockchain interoperability through a systematic literature review, identifying 317 relevant works from an initial pool of over 3000 records. We contribute a five-part taxonomy of integration patterns spanning read-only analytics, simulation and intent generation, delegated execution, autonomous signing, and multi-agent workflows; a threat model tailored to agent-driven transaction pipelines that captures risks ranging from prompt injection and policy misuse to key compromise, adversarial execution dynamics, and multi-agent collusion; and a comparative capability matrix analyzing more than 20 representative systems across 13 dimensions, including custody models, permissioning, policy enforcement, observability, and recovery. Building on the gaps revealed by this analysis, we outline a research roadmap centered on two interface abstractions: a Transaction Intent Schema for portable and unambiguous goal specification, and a Policy Decision Record for auditable, verifiable policy enforcement across execution environments. We conclude by proposing a reproducible evaluation suite and benchmarks for assessing the safety, reliability, and economic robustness of agent-mediated on-chain execution.

Data-Driven Terramechanics Approach Towards a Realistic Real-Time Simulator for Lunar Rovers

Authors:Jakob M. Kern, James M. Hurrell, Shreya Santra, Keisuke Takehana, Kentaro Uno, Kazuya Yoshida
Date:2026-01-08 03:23:31

High-fidelity simulators for the lunar surface provide a digital environment for extensive testing of rover operations and mission planning. However, current simulators focus on either visual realism or physical accuracy, which limits their capability to replicate lunar conditions comprehensively. This work addresses that gap by combining high visual fidelity with realistic terrain interaction for a realistic representation of rovers on the lunar surface. Because direct simulation of wheel-soil interactions is computationally expensive, a data-driven approach was adopted, using regression models for slip and sinkage from data collected in both full-rover and single-wheel experiments and simulations. The resulting regression-based terramechanics model accurately reproduced steady-state and dynamic slip, as well as sinkage behavior, on flat terrain and slopes up to 20 degrees, with validation against field test results. Additionally, improvements were made to enhance the realism of terrain deformation and wheel trace visualization. This method supports real-time applications that require physically plausible terrain response alongside high visual fidelity.

GUITester: Enabling GUI Agents for Exploratory Defect Discovery

Authors:Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, Jitao Sang
Date:2026-01-08 02:07:53

Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi-modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: \textit{Goal-Oriented Masking}, where agents prioritize task completion over reporting anomalies, and \textit{Execution-Bias Attribution}, where system defects are misidentified as agent errors. To address these, we first introduce \textbf{GUITestBench}, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose \textbf{GUITester}, a multi-agent framework that decouples navigation from verification via two modules: (i) a \textit{Planning-Execution Module (PEM)} that proactively probes for defects via embedded testing intents, and (ii) a \textit{Hierarchical Reflection Module (HRM)} that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1-score of 48.90\% (Pass@3) on GUITestBench, outperforming state-of-the-art baselines (33.35\%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnote{Our code is now available in~\href{https://github.com/ADaM-BJTU/GUITestBench}{https://github.com/ADaM-BJTU/GUITestBench}}.

Understanding Gaming the System by Analyzing Self-Regulated Learning in Think-Aloud Protocols

Authors:Jiayi Zhang, Conrad Borchers, Canwen Wang, Vishal Kumar, Leah Teffera, Bruce M. McLaren, Ryan S. Baker
Date:2026-01-08 01:45:56

In digital learning systems, gaming the system refers to occasions when students attempt to succeed in an educational task by systematically taking advantage of system features rather than engaging meaningfully with the content. Often viewed as a form of behavioral disengagement, gaming the system is negatively associated with short- and long-term learning outcomes. However, little research has explored this phenomenon beyond its behavioral representation, leaving questions such as whether students are cognitively disengaged or whether they engage in different self-regulated learning (SRL) strategies when gaming largely unanswered. This study employs a mixed-methods approach to examine students' cognitive engagement and SRL processes during gaming versus non-gaming periods, using utterance length and SRL codes inferred from think-aloud protocols collected while students interacted with an intelligent tutoring system for chemistry. We found that gaming does not simply reflect a lack of cognitive effort; during gaming, students often produced longer utterances, were more likely to engage in processing information and realizing errors, but less likely to engage in planning, and exhibited reactive rather than proactive self-regulatory strategies. These findings provide empirical evidence supporting the interpretation that gaming may represent a maladaptive form of SRL. With this understanding, future work can address gaming and its negative impacts by designing systems that target maladaptive self-regulation to promote better learning.

Decision-Aware Trust Signal Alignment for SOC Alert Triage

Authors:Israt Jahan Chowdhury, Md Abu Yousuf Tanvir
Date:2026-01-08 01:41:54

Detection systems that utilize machine learning are progressively implemented at Security Operations Centers (SOCs) to help an analyst to filter through high volumes of security alerts. Practically, such systems tend to reveal probabilistic results or confidence scores which are ill-calibrated and hard to read when under pressure. Qualitative and survey based studies of SOC practice done before reveal that poor alert quality and alert overload greatly augment the burden on the analyst, especially when tool outputs are not coherent with decision requirements, or signal noise. One of the most significant limitations is that model confidence is usually shown without expressing that there are asymmetric costs in decision making where false alarms are much less harmful than missed attacks. The present paper presents a decision-sensitive trust signal correspondence scheme of SOC alert triage. The framework combines confidence that has been calibrated, lightweight uncertainty cues, and cost-sensitive decision thresholds into coherent decision-support layer, instead of making changes to detection models. To enhance probabilistic consistency, the calibration is done using the known post-hoc methods and the uncertainty cues give conservative protection in situations where model certainty is low. To measure the model-independent performance of the suggested model, we apply the Logistic Regression and the Random Forest classifiers to the UNSW-NB15 intrusion detection benchmark. According to simulation findings, false negatives are greatly amplified by the presence of misaligned displays of confidence, whereas cost weighted loss decreases by orders of magnitude between models with decision aligned trust signals. Lastly, we describe a human-in-the-loop study plan that would allow empirically assessing the decision-making of the analysts with aligned and misaligned trust interfaces.

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Authors:Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
Date:2026-01-07 23:49:52

World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery

Authors:Yike Zhang, Eduardo Davalos, Dingjie Su, Ange Lou, Jack Noble
Date:2026-01-07 21:23:35

Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.

Biomechanically Informed Image Registration for Patient-Specific Aortic Valve Strain Analysis

Authors:Mohsen Nakhaei, Alison Pouch, Silvani Amin, Matthew Daemer, Christian Herz, Natalie Yushkevich, Lourdes Al Ghofaily, Nimesh Desai, Joseph Bavaria, Matthew Jolley, Wensi Wu
Date:2026-01-07 20:31:46

Aortic valve (AV) biomechanics play a critical role in maintaining normal cardiac function. Pathological variations, particularly in bicuspid aortic valves (BAVs), alter leaflet loading, increase strain, and accelerate disease progression. Accurate, patient-specific characterization of valve geometry and deformation is essential for predicting disease progression and guiding durable repair. Current imaging and computational methods often fail to capture rapid valve motion and complex patient-specific features. To address these challenges, we combined image registration with finite element method (FEM) to enhance AV tracking and biomechanical assessment. Patient-specific valve geometries from 4D transesophageal echocardiography (TEE) and CT were used in FEM to model AV closure and generate intermediate deformation states. The FEM-generated states facilitated leaflet tracking, while the registration algorithm corrected mismatches between simulation and image. Across 20 patients, FEM-augmented registration improved accuracy by 40% compared with direct registration (33% for TEE, 46% for CT). This improvement enabled more reliable strain estimation directly from imaging and reducing uncertainties from boundary conditions and material assumptions. Areal and Green-Lagrange strains, as well as effective strain, were quantified in adult trileaflet/bicuspid, and pediatric patients. Trileaflet adults showed uniform deformation, BAVs exhibited asymmetric strain, and pediatric valves had low mean areal strain with high variability. Convergence between trileaflet adult and pediatric valves in mean effective strain suggests volumetric deformation drives age- and size-related differences. The FEM-augmented registration framework enhances geometric tracking and provides clinically relevant insights into patient-specific AV deformation, supporting individualized intervention planning.

Phasor Agents: Oscillatory Graphs with Three-Factor Plasticity and Sleep-Staged Learning

Authors:Rodja Trappe
Date:2026-01-07 19:57:02

Phasor Agents are dynamical systems whose internal state is a Phasor Graph: a weighted graph of coupled Stuart-Landau oscillators. A Stuart-Landau oscillator is a minimal stable "rhythm generator" (the normal form near a Hopf bifurcation); each oscillator is treated as an abstract computational unit (inspired by, but not claiming to model, biological oscillatory populations). In this interpretation, oscillator phase tracks relative timing (coherence), while amplitude tracks local gain or activity. Relative phase structure serves as a representational medium; coupling weights are learned via three-factor local plasticity - eligibility traces gated by sparse global modulators and oscillation-timed write windows - without backpropagation. A central challenge in oscillatory substrates is stability: online weight updates can drive the network into unwanted regimes (e.g., global synchrony), collapsing representational diversity. We therefore separate wake tagging from offline consolidation, inspired by synaptic tagging-and-capture and sleep-stage dynamics: deep-sleep-like gated capture commits tagged changes safely, while REM-like replay reconstructs and perturbs experience for planning. A staged experiment suite validates each mechanism with ablations and falsifiers: eligibility traces preserve credit under delayed modulation; compression-progress signals pass timestamp-shuffle controls; phase-coherent retrieval reaches 4x diffusive baselines under noise; wake/sleep separation expands stable learning by 67 percent under matched weight-norm budgets; REM replay improves maze success rate by +45.5 percentage points; and a Tolman-style latent-learning signature - immediate competence and detour advantage after unrewarded exploration, consistent with an internal model - emerges from replay (Tolman, 1948). The codebase and all artifacts are open-source.

UNIC: Learning Unified Multimodal Extrinsic Contact Estimation

Authors:Zhengtong Xu, Yuki Shirai
Date:2026-01-07 19:43:16

Contact-rich manipulation requires reliable estimation of extrinsic contacts-the interactions between a grasped object and its environment which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

Authors:Erel Kaplan, Tomer Bitan, Lian Ghrayeb, Le Chen, Tom Yotam, Niranjan Hasabnis, Gal Oren
Date:2026-01-07 19:04:53

Parallel programming is central to HPC and AI, but producing code that is correct and fast remains challenging, especially for OpenMP GPU offload, where data movement and tuning dominate. Autonomous coding agents can compile, test, and profile on target hardware, but outputs are brittle without domain scaffolding. We present ParaCodex, an HPC-engineer workflow that turns a Codex-based agent into an autonomous OpenMP GPU offload system using staged hotspot analysis, explicit data planning, correctness gating, and profiling-guided refinement. We evaluate translation from serial CPU kernels to OpenMP GPU offload kernels on HeCBench, Rodinia, and NAS. After excluding five kernels, ParaCodex succeeded on all 31 valid kernels. The generated kernels improved GPU time over reference OpenMP implementations in 25/31 cases, achieving geometric-mean speedups of 3x on HeCBench and 5x on Rodinia, and outperforming a zero-shot Codex baseline on all suites. We also evaluate CUDA to OpenMP offload translation on ParEval, where ParaCodex maintains high compilation and validation rates in code-only and end-to-end settings.

Prediction Intervals for Interim Events in Randomized Clinical Trials with Time-to-Event Endpoints

Authors:Edoardo Ratti, Federico L. Perlino, Stefania Galimberti, Maria G. Valsecchi
Date:2026-01-07 18:58:45

Time-to-event endpoints are central to evaluate treatment efficacy across many disease areas. Many trial protocols include interim analyses within group-sequential designs that control type I error via spending functions or boundary methods. The corresponding operating characteristics depend on the number of looks and the information accrued. Planning interim analyses with time-to-event endpoints is challenging because statistical information depends on the number of observed events. Ensuring adequate follow-up to accrue the required events is therefore critical, making interim prediction of information at scheduled looks and at the final analysis essential. While several methods have been developed to predict the calendar time required to reach a target number of events, to the best of our knowledge there is no established framework that addresses the prediction of the number of events at a future date with corresponding prediction intervals. Starting from an prediction interval approach originally developed in reliability engineering for the number of future component failures, we reformulated and extended it to the context of interim monitoring in clinical trials. This adaptation yields a general framework for event-count prediction intervals in the clinical setting, taking the patient as the unit of analysis and accommodating a range of parametric survival models, patient-level covariates, stagged entry and possible dependence between entry dates and lost to follow-up. Prediction intervals are obtained in a frequentist framework from a bootstrap estimator of the conditional distribution of future events. The performance of the proposed approach is investigated via simulation studies and illustrated by analyzing a real-world phase III trial in childhood acute lymphoblastic leukaemia.

Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Authors:Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang
Date:2026-01-07 17:50:37

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Authors:Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang
Date:2026-01-07 16:21:19

The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.

Stage-specific cancer survival prediction enriched by explainable machine learning

Authors:Parisa Poorhasani, Bogdan Iancu
Date:2026-01-07 14:44:04

Despite the fact that cancer survivability rates vary greatly between stages, traditional survival prediction models have frequently been trained and assessed using examples from all combined phases of the disease. This method may result in an overestimation of performance and ignore the stage-specific variations. Using the SEER dataset, we created and verified explainable machine learning (ML) models to predict stage-specific cancer survivability in colorectal, stomach, and liver cancers. ML-based cancer survival analysis has been a long-standing topic in the literature; however, studies involving the explainability and transparency of ML survivability models are limited. Our use of explainability techniques, including SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), enabled us to illustrate significant feature-cancer stage interactions that would have remained hidden in traditional black-box models. We identified how certain demographic and clinical variables influenced survival differently across cancer stages and types. These insights provide not only transparency but also clinical relevance, supporting personalized treatment planning. By focusing on stage-specific models, this study provides new insights into the most important factors at each stage of cancer, offering transparency and potential clinical relevance to support personalized treatment planning.

A Future Capabilities Agent for Tactical Air Traffic Control

Authors:Paul Kent, George De Ath, Martin Layton, Allen Hart, Richard Everson, Ben Carvell
Date:2026-01-07 14:19:46

Escalating air traffic demand is driving the adoption of automation to support air traffic controllers, but existing approaches face a trade-off between safety assurance and interpretability. Optimisation-based methods such as reinforcement learning offer strong performance but are difficult to verify and explain, while rules-based systems are transparent yet rarely check safety under uncertainty. This paper outlines Agent Mallard, a forward-planning, rules-based agent for tactical control in systemised airspace that embeds a stochastic digital twin directly into its conflict-resolution loop. Mallard operates on predefined GPS-guided routes, reducing continuous 4D vectoring to discrete choices over lanes and levels, and constructs hierarchical plans from an expert-informed library of deconfliction strategies. A depth-limited backtracking search uses causal attribution, topological plan splicing, and monotonic axis constraints to seek a complete safe plan for all aircraft, validating each candidate manoeuvre against uncertain execution scenarios (e.g., wind variation, pilot response, communication loss) before commitment. Preliminary walkthroughs with UK controllers and initial tests in the BluebirdDT airspace digital twin indicate that Mallard's behaviour aligns with expert reasoning and resolves conflicts in simplified scenarios. The architecture is intended to combine model-based safety assessment, interpretable decision logic, and tractable computational performance in future structured en-route environments.

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM

Authors:Kangjie Zhou, Zhejia Wen, Zhiyong Zhuo, Zike Yan, Pengying Wu, Ieng Hou U, Shuaiyang Li, Han Gao, Kang Ding, Wenhan Cao, Wei Pan, Chang Liu
Date:2026-01-07 14:10:46

Recent Vision-Language Models (VLMs) have demonstrated significant potential in robotic planning. However, they typically function as semantic reasoners, lacking an intrinsic understanding of the specific robot's physical capabilities. This limitation is particularly critical in interactive navigation, where robots must actively modify cluttered environments to create traversable paths. Existing VLM-based navigators are predominantly confined to passive obstacle avoidance, failing to reason about when and how to interact with objects to clear blocked paths. To bridge this gap, we propose Counterfactual Interactive Navigation via Skill-aware VLM (CoINS), a hierarchical framework that integrates skill-aware reasoning and robust low-level execution. Specifically, we fine-tune a VLM, named InterNav-VLM, which incorporates skill affordance and concrete constraint parameters into the input context and grounds them into a metric-scale environmental representation. By internalizing the logic of counterfactual reasoning through fine-tuning on the proposed InterNav dataset, the model learns to implicitly evaluate the causal effects of object removal on navigation connectivity, thereby determining interaction necessity and target selection. To execute the generated high-level plans, we develop a comprehensive skill library through reinforcement learning, specifically introducing traversability-oriented strategies to manipulate diverse objects for path clearance. A systematic benchmark in Isaac Sim is proposed to evaluate both the reasoning and execution aspects of interactive navigation. Extensive simulations and real-world experiments demonstrate that CoINS significantly outperforms representative baselines, achieving a 17\% higher overall success rate and over 80\% improvement in complex long-horizon scenarios compared to the best-performing baseline