planning - 2025-12-15

The Right Kind of Help: Evaluating the Effectiveness of Intervention Methods in Elementary-Level Visual Programming

Authors:Ahana Ghosh, Liina Malva, Alkis Gotovos, Danial Hooshyar, Adish Singla
Date:2025-12-12 17:22:06

Prior work has explored various intervention methods for elementary programming. However, the relative impact of these methods during the learning and post-learning phases remains unclear. In this work, we present a large-scale study comparing the effectiveness of various intervention methods in elementary programming both during learning and on novel tasks post-learning. Specifically, we compare three intervention methods: code-edit recommendations (Code-Rec), quizzes based on code edits (Code-Quiz), and quizzes based on metacognitive strategies (Plan-Quiz), along with a no-intervention control (group None). A total of 398 students (across grades 4-7) participated in a two-phase study: learning phase comprising write-code tasks from the Hour of Code: Maze Challenge with the intervention, followed by a post-learning phase comprising more advanced write-code tasks without any intervention. All intervention methods significantly improved learning performance over the control group while preserving students' problem-solving skills in the post-learning phase. Quiz-based methods further improved performance on novel post-learning tasks. Students in intervention groups also reported greater engagement and perceived skill growth.

Referring Change Detection in Remote Sensing Imagery

Authors:Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel
Date:2025-12-12 16:57:12

Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.

Two-dimensional Decompositions of High-dimensional Configurations for Efficient Multi-vehicle Coordination at Intelligent Intersections

Authors:Amirreza Akbari, Johan Thunberg
Date:2025-12-12 16:50:17

For multi-vehicle complex traffic scenarios in shared spaces such as intelligent intersections, safe coordination and trajectory planning is challenging due to computational complexity. To meet this challenge, we introduce a computationally efficient method for generating collision-free trajectories along predefined vehicle paths. We reformulate a constrained minimum-time trajectory planning problem as a problem in a high-dimensional configuration space, where conflict zones are modeled by high-dimensional polyhedra constructed from two-dimensional rectangles. Still, in such a formulation, as the number of vehicles involved increases, the computational complexity increases significantly. To address this, we propose two algorithms for near-optimal local optimization that significantly reduce the computational complexity by decomposing the high-dimensional problem into a sequence of 2D graph search problems. The resulting trajectories are then incorporated into a Nonlinear Model Predictive Control (NMPC) framework to ensure safe and smooth vehicle motion. We furthermore show in numerical evaluation that this approach significantly outperforms existing MILP-based time-scheduling; both in terms of objective-value and computational time.

MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Authors:Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl
Date:2025-12-12 16:01:48

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

On Intensity of Preference Rank Reversal in the AHP

Authors:Jiri Mazurek, Luis Ángel Calvo
Date:2025-12-12 15:01:16

The analytic hierarchy process (AHP) is one of the most widely used multicriteria decision-making methods, with applications from agriculture to space engineering. Despite its popularity, AHP has been repeatedly criticised for rank reversal, a phenomenon in which the ranking of alternatives changes after the addition or removal of an irrelevant or duplicate alternative. This paper introduces a new type of rank reversal in AHP, arising when the intensity of preferences is uniformly increased. We show that even when all pairwise preferences preserve their direction and are intensified identically, the eigenvector method may produce a different ordering of alternatives. In contrast, the geometric mean (GM) method is robust to this intensity-of-preference (IOP) rank reversal. The applicability of this result is shown through a real decision-making problem taken from a NASA manual concerning capability prioritisation for the planned lunar Gateway orbital station.

Architecting Large Action Models for Human-in-the-Loop Intelligent Robots

Authors:Kanisorn Sangchai, Methasit Boonpun, Withawin Kraipetchara, Paulo Garcia
Date:2025-12-12 14:58:27

The realization of intelligent robots, operating autonomously and interacting with other intelligent agents, human or artificial, requires the integration of environment perception, reasoning, and action. Classic Artificial Intelligence techniques for this purpose, focusing on symbolic approaches, have long-ago hit the scalability wall on compute and memory costs. Advances in Large Language Models in the past decade (neural approaches) have resulted in unprecedented displays of capability, at the cost of control, explainability, and interpretability. Large Action Models aim at extending Large Language Models to encompass the full perception, reasoning, and action cycle; however, they typically require substantially more comprehensive training and suffer from the same deficiencies in reliability. Here, we show it is possible to build competent Large Action Models by composing off-the-shelf foundation models, and that their control, interpretability, and explainability can be effected by incorporating symbolic wrappers and associated verification on their outputs, achieving verifiable neuro-symbolic solutions for intelligent robots. Our experiments on a multi-modal robot demonstrate that Large Action Model intelligence does not require massive end-to-end training, but can be achieved by integrating efficient perception models with a logic-driven core. We find that driving action execution through the generation of Planning Domain Definition Language (PDDL) code enables a human-in-the-loop verification stage that effectively mitigates action hallucinations. These results can support practitioners in the design and development of robotic Large Action Models across novel industries, and shed light on the ongoing challenges that must be addressed to ensure safety in the field.

Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents

Authors:Stefan Tabakov, Asen Popov, Dimitar Dimitrov, S. Ensiye Kiyamousavi, Vladimir Hristov, Boris Kraychev
Date:2025-12-12 14:14:27

Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)

Cross-Entropy Optimization of Physically Grounded Task and Motion Plans

Authors:Andreu Matoses Gimenez, Nils Wilde, Chris Pek, Javier Alonso-Mora
Date:2025-12-12 13:59:23

Autonomously performing tasks often requires robots to plan high-level discrete actions and continuous low-level motions to realize them. Previous TAMP algorithms have focused mainly on computational performance, completeness, or optimality by making the problem tractable through simplifications and abstractions. However, this comes at the cost of the resulting plans potentially failing to account for the dynamics or complex contacts necessary to reliably perform the task when object manipulation is required. Additionally, approaches that ignore effects of the low-level controllers may not obtain optimal or feasible plan realizations for the real system. We investigate the use of a GPU-parallelized physics simulator to compute realizations of plans with motion controllers, explicitly accounting for dynamics, and considering contacts with the environment. Using cross-entropy optimization, we sample the parameters of the controllers, or actions, to obtain low-cost solutions. Since our approach uses the same controllers as the real system, the robot can directly execute the computed plans. We demonstrate our approach for a set of tasks where the robot is able to exploit the environment's geometry to move an object. Website and code: https://andreumatoses.github.io/research/parallel-realization

CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

Authors:Weijian Ma, Shizhao Sun, Ruiyu Wang, Jiang Bian
Date:2025-12-12 11:29:59

A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.

RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

Authors:Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, Wei Wang
Date:2025-12-12 06:03:56

Rollout-training disaggregation is emerging as the standard architecture for Reinforcement Learning (RL) post-training, where memory-bound rollout and compute-bound training are physically disaggregated onto purpose-built clusters to maximize hardware efficiency. However, the strict synchronization required by on-policy algorithms introduces severe dependency bubbles, forcing one cluster to idle while the dependent phase is running on the other. We present RollMux, a cluster scheduling framework that reclaims these bubbles through cross-cluster orchestration. RollMux is built on the insight that the structural idleness of one job can be effectively utilized by the active phase of another. To realize this, we introduce the co-execution group abstraction, which partitions the cluster into isolated locality domains. This abstraction enables a two-tier scheduling architecture: an inter-group scheduler that optimizes job placement using conservative stochastic planning, and an intra-group scheduler that orchestrates a provably optimal round-robin schedule. The group abstraction also imposes a residency constraint, ensuring that massive model states remain cached in host memory to enable "warm-star" context switching. We evaluate RollMux on a production-scale testbed with 328 H20 and 328 H800 GPUs. RollMux improves cost efficiency by 1.84x over standard disaggregation and 1.38x over state-of-the-art co-located baselines, all while achieving 100% SLO attainment.

Words to Describe What I'm Feeling: Exploring the Potential of AI Agents for High Subjectivity Decisions in Advance Care Planning

Authors:Kellie Yu Hui Sim, Pin Sym Foong, Chenyu Zhao, Melanie Yi Ning Quek, Swarangi Subodh Mehta, Kenny Tsu Wei Choo
Date:2025-12-12 04:39:34

Serious illness can deprive patients of the capacity to speak for themselves. As populations age and caregiver networks shrink, the need for reliable support in Advance Care Planning (ACP) grows. To probe this fraught design space of using proxy agents for high-risk, high-subjectivity decisions, we built an experience prototype (\acpagent{}) and asked 15 participants in 4 workshops to train it to be their personal proxy in ACP decisions. We analysed their coping strategies and feature requests and mapped the results onto axes of agent autonomy and human control. Our findings argue for a potential new role of AI in ACP where agents act as personal advocates for individuals, building mutual intelligibility over time. We conclude with design recommendations to balance the risks and benefits of such an agent.

RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

Authors:Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li
Date:2025-12-12 02:33:09

Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.

FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model

Authors:Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, Zhen Li
Date:2025-12-12 02:12:49

In autonomous driving, end-to-end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT-driven pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto-think Switch examines the current scene and decides whether additional reasoning is required to yield a higher-quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT-guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.

FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

Authors:Dongwon Jung, Peng Shi, Yi Zhang
Date:2025-12-12 01:43:48

Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

Asteroid Occultations with VERITAS: Observations and Considerations

Authors:Joshua Bartkoske, Anne Duerr, Dave Kieda
Date:2025-12-11 22:08:52

Occultations, the covering up of one celestial body by another celestial body, have been used in astronomy for millennia to learn about the sun and moon. Since 2018, VERITAS has implemented a program to detect predicted asteroid occultations, where an asteroid covers up a star. VERITAS has attempted to observe over 100 occultations to date and successfully observed 20 occultations. With these occultations, VERITAS can directly measure the smallest angular diameters of any instrument or technique in the optical for stars between magnitude 9 and 13. Each angular diameter is measured by fitting the diffraction pattern observed by the central VERITAS pixel at the start and end of an occultation. Once a planned FADC upgrade is complete, VERITAS will begin a program to search for serendipitous occultations within its full field of view (3 deg). Serendipitous occultations of sub-km trans-Neptunian objects (TNOs) have the potential to constrain models of solar system formation. This work details how VERITAS predicts and observes occultations as well as the overall status of the asteroid occultation program and future steps for observing occultations of both asteroids and TNOs.

In-Context Multi-Objective Optimization

Authors:Xinyu Zhang, Conor Hassan, Julien Martinelli, Daolang Huang, Samuel Kaski
Date:2025-12-11 20:56:42

Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

Your plan may succeed, but what about failure? Investigating how people use ChatGPT for long-term life task planning

Authors:Ben Wang, Jiqun Liu
Date:2025-12-11 20:12:48

Long-term life task planning is inherently complex and uncertain, yet little is known about how emerging AI systems support this process. This study investigates how people use ChatGPT for such planning tasks, focusing on user practices, uncertainties, and perceptions of AI assistance. We conducted an interview study with 14 participants who engaged in long-term planning activities using ChatGPT, combining analysis of their prompts and interview responses. The task topics across diverse domains, including personal well-being, event planning, and professional learning, along with prompts to initiate, refine, and contextualize plans. ChatGPT helped structure complex goals into manageable steps, generate ideas, and sustain motivation, serving as a reflective partner. Yet its outputs were often generic or idealized, lacking personalization, contextual realism, and adaptability, requiring users to actively adapt and verify results. Participants expressed a need for AI systems that provide adaptive and trustworthy guidance while acknowledging uncertainty and potential failure in long-term planning. Our findings show how AI supports long-term life task planning under evolving uncertainty and highlight design implications for systems that are adaptive, uncertainty-aware, and capable of supporting long-term planning as an evolving human-AI collaboration.

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

Authors:Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu
Date:2025-12-11 18:59:46

Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.

A vision for ground-based astronomy beyond the 2030s: How to build ESO's next big telescope sustainably

Authors:Laurane Fréour, Mathilde Bouvier, Tony Mroczkowski, Callie Clontz, Fatemeh Zahra Majidi, Vasundhara Shaw, Olivier Absil, Anna Cabré, Olivier Lai, Dylan Magill, Jake D. Turner
Date:2025-12-11 18:31:04

Astronomy is the study of the Universe and all the objects that it comprises. Our attention is therefore usually focused beyond Earth, home to the only form of life known today. However, how can we continue to explore the secrets of the Universe, if we stand by and watch our only home burn? We know that there is no Planet B. It is therefore urgent that, as astronomers, we collectively work to protect the Earth, allowing future generations the opportunity to continue to uncover the secrets of the cosmos. As astronomical facilities account for the majority of our community's carbon footprint, we propose guidelines that we hold crucial for the European Southern Observatory (ESO) to consider in the context of the Expanding Horizons programme as it plans a next-generation, transformational facility.

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Authors:Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang
Date:2025-12-11 17:57:24

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Authors:Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell
Date:2025-12-11 14:59:07

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.

COMPARE: Clinical Optimization with Modular Planning and Assessment via RAG-Enhanced AI-OCT: Superior Decision Support for Percutaneous Coronary Intervention Compared to ChatGPT-5 and Junior Operators

Authors:Wei Fang, Chiyao Wang, Wenshuai Ma, Hui Liu, Jianqiang Hu, Xiaona Niu, Yi Chu, Mingming Zhang, Jingxiao Yang, Dongwei Zhang, Zelin Li, Pengyun Liu, Jiawei Zheng, Pengke Zhang, Chaoshi Qin, Wangang Guo, Bin Wang, Yugang Xue, Wei Zhang, Zikuan Wang, Rui Zhu, Yihui Cao, Quanmao Lu, Rui Meng, Yan Li
Date:2025-12-11 14:41:37

Background: While intravascular imaging, particularly optical coherence tomography (OCT), improves percutaneous coronary intervention (PCI) outcomes, its interpretation is operator-dependent. General-purpose artificial intelligence (AI) shows promise but lacks domain-specific reliability. We evaluated the performance of CA-GPT, a novel large model deployed on an AI-OCT system, against that of the general-purpose ChatGPT-5 and junior physicians for OCT-guided PCI planning and assessment. Methods: In this single-center analysis of 96 patients who underwent OCT-guided PCI, the procedural decisions generated by the CA-GPT, ChatGPT-5, and junior physicians were compared with an expert-derived procedural record. Agreement was assessed using ten pre-specified metrics across pre-PCI and post-PCI phases. Results: For pre-PCI planning, CA-GPT demonstrated significantly higher median agreement scores (5[IQR 3.75-5]) compared to both ChatGPT-5 (3[2-4], P<0.001) and junior physicians (4[3-4], P<0.001). CA-GPT significantly outperformed ChatGPT-5 across all individual pre-PCI metrics and showed superior performance to junior physicians in stent diameter (90.3% vs. 72.2%, P<0.05) and length selection (80.6% vs. 52.8%, P<0.01). In post-PCI assessment, CA-GPT maintained excellent overall agreement (5[4.75-5]), significantly higher than both ChatGPT-5 (4[4-5], P<0.001) and junior physicians (5[4-5], P<0.05). Subgroup analysis confirmed CA-GPT's robust performance advantage in complex scenarios. Conclusion: The CA-GPT-based AI-OCT system achieved superior decision-making agreement versus a general-purpose large language model and junior physicians across both PCI planning and assessment phases. This approach provides a standardized and reliable method for intravascular imaging interpretation, demonstrating significant potential to augment operator expertise and optimize OCT-guided PCI.

nDspec: a new Python library for modelling multi-dimensional datasets in X-ray astronomy

Authors:Matteo Lucchini, Benjamin Ricketts, Phil Uttley, Daniela Huppenkothen
Date:2025-12-11 13:12:18

The current fleet of X-ray telescopes produces a wealth of multi-dimensional data, allowing us to study sources in time, photon energy and polarization. At the same time, it has become increasingly clear that progress in our physical understanding will only come from studying these sources in multiple dimensions simultaneously. Enabling multi-dimensional studies of X-ray sources requires new theoretical models predicting these data sets, new methods to analyse them and a software framework to combine data, models and methods efficiently. In this paper, we introduce the alpha release of nDspec, a new python-based library designed to allow users to model one- and multi-dimensional datasets common to X-ray astronomy. In the alpha release, we focus on modelling time-averaged data as well as Fourier spectral-timing mode, but highlight how additional dimensions can be added. We discuss design philosophy and current features, and showcase an example use case by characterizing a NICER observation of a black hole X-ray binary. We also highlight current plans for extensions to other dimensions and new features.

Motion Planning for Safe Landing of a Human-Piloted Parafoil

Authors:Maximillian Fainkich, Kiril Solovey, Anna Clarke
Date:2025-12-11 12:39:48

Most skydiving accidents occur during the parafoil-piloting and landing stages and result from human lapses in judgment while piloting the parafoil. Training of novice pilots is protracted due to the lack of functional and easily accessible training simulators. Moreover, work on parafoil trajectory planning suitable for aiding human training remains limited. To bridge this gap, we study the problem of computing safe trajectories for human-piloted parafoil flight and examine how such trajectories fare against human-generated solutions. For the algorithmic part, we adapt the sampling-based motion planner Stable Sparse RRT (SST) by Li et al., to cope with the problem constraints while minimizing the bank angle (control effort) as a proxy for safety. We then compare the computer-generated solutions with data from human-generated parafoil flight, where the algorithm offers a relative cost improvement of 20\%-80\% over the performance of the human pilot. We observe that human pilots tend to, first, close the horizontal distance to the landing area, and then address the vertical gap by spiraling down to the suitable altitude for starting a landing maneuver. The algorithm considered here makes smoother and more gradual descents, arriving at the landing area at the precise altitude necessary for the final approach while maintaining safety constraints. Overall, the study demonstrates the potential of computer-generated guidelines, rather than traditional rules of thumb, which can be integrated into future simulators to train pilots for safer and more cost-effective flights.

Reconstructing Transportation Cost Planning Theory: A Multi-Layered Framework Integrating Stepwise Functions, AI-Driven Dynamic Pricing, and Sustainable Autonomy

Authors:Samuel Darwisman
Date:2025-12-11 10:15:00

The theoretical landscape of transportation cost planning is shifting from deterministic linear models to dynamic, data-driven optimization. As supply chains face volatility, static 20th-century cost assumptions prove increasingly inadequate. Despite rapid technological advancements, a unified framework linking economic production theory with the operational realities of autonomous, sustainable logistics remains absent. Existing models fail to address non-linear stepwise costs and real-time stochastic variables introduced by market dynamics. This study reconstructs transportation cost planning theory by synthesizing Grand, Middle-Range, and Applied theories. It aims to integrate stepwise cost functions, AI-driven decision-making, and environmental externalities into a cohesive planning model. A systematic theoretical synthesis was conducted using 28 high-impact papers published primarily between 2018 and 2025, employing multi-layered analysis to reconstruct cost drivers. The study identifies three critical shifts: the transition from linear to stepwise fixed costs, the necessity of AI-driven dynamic pricing for revenue optimization, and the role of Autonomous Electric Vehicles (AEVs) in minimizing long-term marginal costs. A "Dynamic-Sustainable Cost Planning Theory" is proposed, arguing that cost efficiency now depends on algorithmic prediction and autonomous fleet utilization rather than simple distance minimization.

T-SKM-Net: Trainable Neural Network Framework for Linear Constraint Satisfaction via Sampling Kaczmarz-Motzkin Method

Authors:Haoyu Zhu, Yao Zhang, Jiashen Ren, Qingchun Hou
Date:2025-12-11 09:35:13

Neural network constraint satisfaction is crucial for safety-critical applications such as power system optimization, robotic path planning, and autonomous driving. However, existing constraint satisfaction methods face efficiency-applicability trade-offs, with hard constraint methods suffering from either high computational complexity or restrictive assumptions on constraint structures. The Sampling Kaczmarz-Motzkin (SKM) method is a randomized iterative algorithm for solving large-scale linear inequality systems with favorable convergence properties, but its argmax operations introduce non-differentiability, posing challenges for neural network applications. This work proposes the Trainable Sampling Kaczmarz-Motzkin Network (T-SKM-Net) framework and, for the first time, systematically integrates SKM-type methods into neural network constraint satisfaction. The framework transforms mixed constraint problems into pure inequality problems through null space transformation, employs SKM for iterative solving, and maps solutions back to the original constraint space, efficiently handling both equality and inequality constraints. We provide theoretical proof of post-processing effectiveness in expectation and end-to-end trainability guarantees based on unbiased gradient estimators, demonstrating that despite non-differentiable operations, the framework supports standard backpropagation. On the DCOPF case118 benchmark, our method achieves 4.27ms/item GPU serial forward inference with 0.0025% max optimality gap with post-processing mode and 5.25ms/item with 0.0008% max optimality gap with joint training mode, delivering over 25$\times$ speedup compared to the pandapower solver while maintaining zero constraint violations under given tolerance.

Robust Crop Planning under Uncertainty: Aligning Economic Optimality with Agronomic Sustainability

Authors:Runhao Liu, Ziming Chen, You Li, Peng Zhang
Date:2025-12-11 08:04:57

Long-horizon agricultural planning requires optimizing crop allocation under complex spatial heterogeneity, temporal agronomic dependencies, and multi-source environmental uncertainty. Existing approaches often treat crop interactions, such as legume-cereal complementarity, which implicitly or rely on static deterministic formulations that fail to guarantee resilience against market and climate volatility. To address these challenges, we propose a Multi-Layer Robust Crop Planning Framework (MLRCPF) that integrates spatial reasoning, temporal dynamics, and robust optimization. Specifically, we formalize crop-to-crop relationships through a structured interaction matrix embedded within the state-transition logic, and employ a distributionally robust optimization layer to mitigate worst-case risks defined by a data-driven ambiguity set. Evaluations on a real-world high-mix farming dataset from North China demonstrate the effectiveness of the proposed approach. The framework autonomously generates sustainable checkerboard rotation patterns that restore soil fertility, significantly increasing the legume planting ratio compared to deterministic baselines. Economically, it successfully resolves the trade-off between optimality and stability. These results highlight the importance of explicitly encoding domain-specific structural priors into optimization models for resilient decision-making in complex agricultural systems.

Integrated Planning and Machine-Level Scheduling for High-Mix Discrete Manufacturing: A Profit-Driven Heuristic Framework

Authors:Runhao Liu, Ziming Chen, You Li, Zequn Xie, Peng Zhang
Date:2025-12-11 07:17:29

Modern manufacturing enterprises struggle to create efficient and reliable production schedules under multi-variety, small-batch, and rush-order conditions. High-mix discrete manufacturing systems require jointly optimizing mid-term production planning and machine-level scheduling under heterogeneous resources and stringent delivery commitments. We address this problem with a profit-driven integrated framework that couples a mixed-integer planning model with a machine-level scheduling heuristic. The planning layer allocates production, accessory co-production, and outsourcing under aggregate economic and capacity constraints, while the scheduling layer refines these allocations using a structure-aware procedure that enforces execution feasibility and stabilizes daily machine behavior. This hierarchical design preserves the tractability of aggregated optimization while capturing detailed operational restrictions. Evaluations are conducted on a real industrial scenario. A flexible machine-level execution scheme yields 73.3% on-time completion and significant outsourcing demand, revealing bottleneck congestion. In contrast, a stability-enforcing execution policy achieves 100% on-time completion, eliminates all outsourcing, and maintains balanced machine utilization with only 1.9 to 4.6% capacity loss from changeovers. These results show that aligning planning decisions with stability-oriented execution rules enables practical and interpretable profit-maximizing decisions in complex manufacturing environments.

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Authors:Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat
Date:2025-12-11 06:46:51

Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

Authors:Sarwan Ali, Taslim Murad
Date:2025-12-10 23:03:10

Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.