planning - 2025-12-03

PPTArena: A Benchmark for Agentic PowerPoint Editing

Authors:Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang

Date:2025-12-02 18:59:50

We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Authors:Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan

Date:2025-12-02 18:59:44

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

Experimental Characterization of Fingertip Trajectory following for a 3-DoF Series-Parallel Hybrid Robotic Finger

Authors:Nicholas Baiata, Nilanjan Chakraborty

Date:2025-12-02 17:23:44

Task-space control of robotic fingers is a critical enabler of dexterous manipulation, as manipulation objectives are most naturally specified in terms of fingertip motions and applied forces rather than individual joint angles. While task-space planning and control have been extensively studied for larger, arm-scale manipulators, demonstrations of precise task-space trajectory tracking in compact, multi-DoF robotic fingers remain scarce. In this paper, we present the physical prototyping and experimental characterization of a three-degree-of-freedom, linkage-driven, series-parallel robotic finger with analytic forward kinematics and a closed-form Jacobian. A resolved motion rate control (RMRC) scheme is implemented to achieve closed-loop task-space trajectory tracking. We experimentally evaluate the fingertip tracking performance across a variety of trajectories, including straight lines, circles, and more complex curves, and report millimeter-level accuracy. To the best of our knowledge, this work provides one of the first systematic experimental demonstrations of precise task-space trajectory tracking in a linkage-driven robotic finger, thereby establishing a benchmark for future designs aimed at dexterous in-hand manipulation.

The future of AI in critical mineral exploration

Authors:Jef Caers

Date:2025-12-02 15:37:48

The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken place over the last two decades Here I propose a solution to this problem where AI is implemented as the enabler of a rigorous scientific method for mineral exploration that aims to reduce cognitive bias and false positives drive down the cost of exploration I propose a new scientific method that is based on a philosophical approach founded on the principles of Bayesianism and falsification In this approach data acquisition is in the first place seen as a means to falsify human generated hypothesis Decision of what data to acquire next is quantified with verifiable metrics and based on rational decision making A practical protocol is provided that can be used as a template in any exploration campaign However in order to make this protocol practical various form of artificial intelligence are needed I will argue that the most important form are one novel unsupervised learning methods that collaborate with domain experts to better understand data and generate multiple competing geological hypotheses and two humanintheloop AI algorithms that can optimally plan various geological geophysical geochemical and drilling data acquisition where uncertainty reduction of geological hypothesis precedes the uncertainty reduction on grade and tonnage

Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control

Authors:Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang

Date:2025-12-02 14:25:05

Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.

CogDrive: Cognition-Driven Multimodal Prediction-Planning Fusion for Safe Autonomy

Authors:Heye Huang, Yibin Yang, Mingfeng Fan, Haoran Wang, Xiaocong Zhao, Jianqiang Wang

Date:2025-12-02 13:53:18

Safe autonomous driving in mixed traffic requires a unified understanding of multimodal interactions and dynamic planning under uncertainty. Existing learning based approaches struggle to capture rare but safety critical behaviors, while rule based systems often lack adaptability in complex interactions. To address these limitations, CogDrive introduces a cognition driven multimodal prediction and planning framework that integrates explicit modal reasoning with safety aware trajectory optimization. The prediction module adopts cognitive representations of interaction modes based on topological motion semantics and nearest neighbor relational encoding. With a differentiable modal loss and multimodal Gaussian decoding, CogDrive learns sparse and unbalanced interaction behaviors and improves long horizon trajectory prediction. The planning module incorporates an emergency response concept and optimizes safety stabilized trajectories, where short term consistent branches ensure safety during replanning cycles and long term branches support smooth and collision free motion under low probability switching modes. Experiments on Argoverse2 and INTERACTION datasets show that CogDrive achieves strong performance in trajectory accuracy and miss rate, while closed loop simulations confirm adaptive behavior in merge and intersection scenarios. By combining cognitive multimodal prediction with safety oriented planning, CogDrive offers an interpretable and reliable paradigm for safe autonomy in complex traffic.

Measuring and Rating Socioeconomic Disparities among Provinces: A Case of Turkiye

Authors:Emre Akusta

Date:2025-12-02 12:17:09

Regional disparities in the economic and social structures of countries have a great impact on their development levels. In geographically, culturally and economically diverse countries like Turkiye, determining the socioeconomic status of the provinces and regional differences is an important step for planning and implementing effective policies. Therefore, this study aims to determine the socioeconomic disparities of the provinces in Turkiye. For this purpose, a socioeconomic development index covering the economic and social dimensions of 81 provinces was constructed. For the index, 16 different indicators representing economic and social factors were used. These indicators were converted into indices using the Min-Max normalization method and Principal Component Analysis. Afterwards, using these indices, the provinces were divided into groups using the K-Means clustering algorithm and the Elbow method. In the last part of the study, the results are presented in a visual format using Scatter Plots, clustering maps and QGIS mapping tools. The results of the study show that 2 of the 81 provinces in Turkiye have very high, 30 high, 25 medium and 24 low socioeconomic indices. Istanbul and Ankara have very high socioeconomic status. In general, the provinces in western Turkiye have a high socioeconomic index, while the provinces in eastern and southeastern Anatolia face serious challenges in terms of socioeconomic indicators.

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

Authors:Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye

Date:2025-12-02 10:40:46

Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.

Anderson localization in high-contrast media with random spherical inclusions

Authors:Matteo Capoferri, Matthias Täufer

Date:2025-12-02 09:38:43

We study spectral properties of partial differential operators modelling composite materials with highly contrasting constituents, comprised of soft spherical inclusions with random radii dispersed in a stiff matrix. Such operators have recently attracted significant interest from the research community, including in the context of stochastic homogenization. In particular, it has been proved that the spectrum of these operators may feature a band-gap structure in the regime where heterogeneities take place on a sufficiently small scale. However, the nature of the limiting (as the small scale tends to zero) spectrum in the above setting is non-classical and not completely understood. In this paper we prove for the first time that Anderson localization occurs near band edges, thus shedding light on the limiting spectral behaviour. Our results rely on recent nontrivial advancements in quantitative unique continuation for PDEs, in combination with assumptions on the model that are standard in the Anderson localization literature, and which we plan to relax in future works.

AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning

Authors:Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti

Date:2025-12-02 09:00:12

Information gathering in large-scale or time-critical scenarios (e.g., environmental monitoring, search and rescue) requires broad coverage within limited time budgets, motivating the use of multi-agent systems. These scenarios are commonly formulated as multi-agent informative path planning (MAIPP), where multiple agents must coordinate to maximize information gain while operating under budget constraints. A central challenge in MAIPP is ensuring effective coordination while the belief over the environment evolves with incoming measurements. Recent learning-based approaches address this by using distributions over future positions as "intent" to support coordination. However, these autoregressive intent predictors are computationally expensive and prone to compounding errors. Inspired by the effectiveness of diffusion models as expressive, long-horizon policies, we propose AID, a fully decentralized MAIPP framework that leverages diffusion models to generate long-term trajectories in a non-autoregressive manner. AID first performs behavior cloning on trajectories produced by existing MAIPP planners and then fine-tunes the policy using reinforcement learning via Diffusion Policy Policy Optimization (DPPO). This two-stage pipeline enables the policy to inherit expert behavior while learning improved coordination through online reward feedback. Experiments demonstrate that AID consistently improves upon the MAIPP planners it is trained from, achieving up to 4x faster execution and 17% increased information gain, while scaling effectively to larger numbers of agents. Our implementation is publicly available at https://github.com/marmotlab/AID.

YingVideo-MV: Music-Driven Multi-Stage Video Generation

Authors:Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen

Date:2025-12-02 07:31:19

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

Channel Knowledge Map Enabled Low-Altitude ISAC Networks: Joint Air Corridor Planning and Base Station Deployment

Authors:Jiaxuan Li, Yilong Chen, Fan Liu, Jie Xu

Date:2025-12-02 06:43:52

This letter addresses the joint air corridor planning and base station (BS) deployment problem for low-altitude integrated sensing and communication (ISAC) networks. In the considered system, unmanned aerial vehicles (UAVs) operate within a structured air corridor composed of connected cubic segments, and multiple BSs need to be selectively deployed at a set of candidate locations to ensure both sensing and communication coverage throughout the corridor. In particular, we leverage the channel knowledge map (CKM) to characterize wireless channels for candidate BS sites prior to deployment, thereby facilitating the offline planning. Under this setup, we minimize the system cost in terms of the weighted sum of the air corridor length and the number of deployed BSs, subject to the constraints on both sensing and communication performance across the corridor. To solve the formulated large-scale nonconvex integer programming problem, we develop a hierarchical coarse-to-fine grid decomposition algorithm. Simulation results demonstrate the benefit of the proposed joint design in reducing the overall deployment cost while ensuring the coverage of the low-altitude ISAC networks.

A Datalake for Data-driven Social Science Research

Authors:Puneet Arya, Ojas Sahasrabudhe, Adwaiya Srivastav, Partha Pratim Das, Maya Ramanath

Date:2025-12-02 06:40:47

Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.

nuScenes Revisited: Progress and Challenges in Autonomous Driving

Authors:Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar

Date:2025-12-02 06:14:28

Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.

MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture

Authors:Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin

Date:2025-12-02 04:47:53

Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.

Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

Authors:Wentao Xiang, Haokang Zhang, Tianhang Yang, Zedong Chu, Ruihang Chu, Shichao Xie, Yujian Yuan, Jian Sun, Zhining Gu, Junjie Wang, Xiaolong Wu, Mu Xu, Yujiu Yang

Date:2025-12-02 04:21:02

Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Authors:Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou

Date:2025-12-02 04:12:57

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Fleet Size and Mix Capacitated Vehicle Routing Problem with Time Windows for Mobile Fast Chargers

Authors:Farhang Motallebi Araghi, Armin Abdolmohammdi, Navid Mojahed, Shima Nazari

Date:2025-12-02 03:45:57

The electrification of off-road heavy equipment presents operational challenges for agencies serving remote sites with limited fixed charging infrastructure. Existing mobile fast charging vehicle (MFCV) planning approaches typically treat fleet design and routing as separate problems, fixing vehicle characteristics before dispatch. This paper formulates a fleet size and mix capacitated vehicle routing problem with time windows (FSMCVRPTW) for MFCV deployment, jointly optimizing fleet composition, charger specifications, routing, and scheduling within a unified mixed-integer linear program. The model incorporates heterogeneous MFCV types with varying power ratings, battery capacities, fuel range, and cost structures, minimizing total daily cost from labor, fuel, amortized capital expenditure, and energy purchase under temporal service windows, resource budgets, and energy-delivery constraints. The formulation is implemented in Python/Gurobi and applied to two case studies using California Department of Transportation wheel-loader data in Los Angeles (dense urban) and Truckee (sparse mountainous). Results show that simultaneous optimization yields compact, well-utilized fleets that meet all service windows while revealing strong sensitivity of unit cost to demand density and geography. The proposed FSMCVRPTW framework provides a generalizable decision-support methodology that co-designs fleet size, charger power, routing, and service schedules in a single optimization layer for context-aware, cost-efficient mobile fast charging.

On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning

Authors:Liyuan Lou, Wanyun Li, Wentian Gan, Yifei Yu, Tengfei Wang, Xin Wang, Zongqian Zhan

Date:2025-12-02 03:32:02

Compared with conventional offline UAV photogrammetry, real-time UAV photogrammetry is essential for time-critical geospatial applications such as disaster response and active digital-twin maintenance. However, most existing methods focus on processing captured images or sequential frames in real time, without explicitly evaluating the quality of the on-the-go 3D reconstruction or providing guided feedback to enhance image acquisition in the target area. This work presents On-the-fly Feedback SfM, an explore-and-exploit framework for real-time UAV photogrammetry, enabling iterative exploration of unseen regions and exploitation of already observed and reconstructed areas in near real time. Built upon SfM on-the-fly , the proposed method integrates three modules: (1) online incremental coarse-mesh generation for dynamically expanding sparse 3D point cloud; (2) online mesh quality assessment with actionable indicators; and (3) predictive path planning for on-the-fly trajectory refinement. Comprehensive experiments demonstrate that our method achieves in-situ reconstruction and evaluation in near real time while providing actionable feedback that markedly reduces coverage gaps and re-flight costs. Via the integration of data collection, processing, 3D reconstruction and assessment, and online feedback, our on the-fly feedback SfM could be an alternative for the transition from traditional passive working mode to a more intelligent and adaptive exploration workflow. Code is now available at https://github.com/IRIS-LAB-whu/OntheflySfMFeedback.

Neural networks for multi-horizon stochastic programming

Authors:Hongyu Zhang, Gabriele Sormani, Enza Messina, Alan King, Francesca Maggioni

Date:2025-12-02 00:19:50

This paper proposes a machine-learning-based solution approach for solving multi-horizon stochastic programs. The approach embeds a deep learning neural network into a multi-horizon stochastic program to approximate the recourse operational objective function. The proposed approach is demonstrated on a UK power system planning problem with uncertainty at investment and operational timescales. The results show that (1) the surrogate neural network performs well across three different architectures, (2) the proposed approach is up to 34.72 times faster than the direct solution of the monolithic deterministic equivalent counterpart, (3) the surrogate-based solutions yield comparable in-sample stability and improved out-of-sample performance relative to the deterministic equivalent, indicating better generalisation to unseen scenarios. The main contributions of the paper are: (1) we propose a machine-learning-based framework for solving multi-horizon stochastic programs, (2) we introduce a neural network embedding formulation tailored to multi-horizon stochastic programs with continuous first-stage decisions and fixed scenario sets, extending existing surrogate modelling approaches from two-stage to multi-horizon settings, and (3) we provide an extensive computational study on a realistic UK power system planning problem, demonstrating the trade-off between approximation accuracy, computational efficiency, and solution robustness for different neural network architectures and scenario set sizes.

Towards Modeling Road Access Deprivation in Sub-Saharan Africa Based on a New Accessibility Metric and Road Quality

Authors:Sebastian Hafner, Qunshan Zhao, Bunmi Alugbin, Kehinde Baruwa, Caleb Cheruiyot, Sabitu Sa'adu Da'u, Xingyi Du, Peter Elias, Helen Elsey, Ryan Engstrom, Serkan Girgin, Diego F. P. Grajales, Esther Judith, Caroline Kabaria, Monika Kuffer, Oluwatoyin Odulana, Francis C. Onyambu, Adenike Shonowo, Dana R. Thomson, Mingyu Zhu, João Porto de Albuquerque

Date:2025-12-01 20:33:13

Access to motorable roads is a critical dimension of urban infrastructure, particularly in rapidly urbanizing regions such as Sub-Saharan Africa. Yet, many urban communities, especially those in informal settlements, remain disconnected from road networks. This study presents a road access deprivation model that combines a new accessibility metric, capturing how well buildings are connected to the road network, with road surface type data as a proxy for road quality. These two components together enable the classification of urban areas into low, medium, or high deprivation levels. The model was applied to Nairobi (Kenya), Lagos (Nigeria), and Kano (Nigeria) using open geospatial datasets. Across all three cities, the majority of built-up areas fall into the low and medium road access deprivation levels, while highly deprived areas are comparatively limited. However, the share of highly deprived areas varies substantially, ranging from only 11.8 % in Nairobi to 27.7 % in Kano. Model evaluation against community-sourced validation data indicates good performance for identifying low deprivation areas (F1 > 0.74), moderate accuracy for medium deprivation in Nairobi and Lagos (F1 > 0.52, lower in Kano), and more variable results for high deprivation (F1 ranging from 0.26 in Kano to 0.69 in Nairobi). Furthermore, analysis of grid cells with multiple validations showed strong agreement among community members, with disagreements occurring mainly between adjacent deprivation levels. Finally, we discussed two types of sources for disagreement with community validations: (1) misalignment between the conceptual model and community perceptions, and (2) the operationalization of the conceptual model. In summary, our road access deprivation modeling approach demonstrates promise as a scalable, interpretable tool for identifying disconnected areas and informing urban planning in data-scarce contexts.

Scalable Distributed Nonlinear Control Under Flatness-Preserving Coupling

Authors:Fengjun Yang, Jake Welde, Nikolai Matni

Date:2025-12-01 19:06:13

We study distributed control for a network of nonlinear, differentially flat subsystems subject to dynamic coupling. Although differential flatness simplifies planning and control for isolated subsystems, the presence of coupling can destroy this property for the overall joint system. Focusing on subsystems in pure-feedback form, we identify a class of compatible lower-triangular dynamic couplings that preserve flatness and guarantee that the flat outputs of the subsystems remain the flat outputs of the coupled system. Further, we show that the joint flatness diffeomorphism can be constructed from those of the individual subsystems and, crucially, its sparsity structure reflects that of the coupling. Exploiting this structure, we synthesize a distributed tracking controller that computes control actions from local information only, thereby ensuring scalability. We validate our proposed framework on a simulated example of planar quadrotors dynamically coupled via aerodynamic downwash, and show that the distributed controller achieves accurate trajectory tracking.

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Authors:Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

Date:2025-12-01 18:59:50

Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.

GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment

Authors:Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, Sebastian Scherer

Date:2025-12-01 18:03:29

Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models

Authors:Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid

Date:2025-12-01 17:57:27

Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data. Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/.

Prejudiced Futures? Algorithmic Bias in Time Series Forecasting and Its Ethical Implications

Authors:Bagattini Alexander, Chen Shao

Date:2025-12-01 16:59:15

Time series prediction algorithms are increasingly central to decision-making in high-stakes domains such as healthcare, energy management, and economic planning. Yet, these systems often inherit and amplify biases embedded in historical data, flawed problem specifications, and socio-technical design decisions. This paper critically examines the ethical foundations and mitigation strategies for algorithmic bias in time series prediction. We outline how predictive models, particularly in temporally dynamic domains, can reproduce structural inequalities and emergent discrimination through proxy variables and feedback loops. The paper advances a threefold contribution: First, it reframes algorithmic bias as a socio- technical phenomenon rooted in normative choices and institutional constraints. Second, it offers a structured diagnosis of bias sources across the pipeline, emphasizing the need for causal modeling, interpretable systems, and inclusive design practices. Third, it advocates for structural reforms that embed fairness through participatory governance, stakeholder engagement, and legally enforceable safeguards. Special attention is given to fairness validation in dynamic environments, proposing multi-metric, temporally-aware, and context- sensitive evaluation methods. Ultimately, we call for an integrated ethics-by-design approach that positions fairness not as a trade-off against performance, but as a co-requisite of responsible innovation. This framework is essential to developing predictive systems that are not only effective and adaptive but also aligned with democratic values and social equity.

The Hidden Cost of Straight Lines: Quantifying Misallocation Risk in Voronoi-based Service Area Models

Authors:JA Torrecilla Pinero, JM Ceballos Martínez, A Cuartero Sáez, P Plaza Caballero, A Cruces López

Date:2025-12-01 15:30:58

Voronoi tessellations are standard in spatial planning for assigning service areas based on Euclidean proximity, underpinning regulatory frameworks like the proximity principle in waste management. However, in regions with complex topography, Euclidean distance poorly approximates functional accessibility, causing misallocations that undermine efficiency and equity. This paper develops a probabilistic framework to quantify misallocation risk by modeling travel distances as random scaling of Euclidean distances and deriving incorrect assignment probability as a function of local Voronoi geometry. Using plant-municipality observations (n=383) in Extremadura, Spain (41,635 km2), we demonstrate that the Log-Normal distribution provides best relative fit among alternatives (K-S statistic=0.110). Validation reveals 15.4% of municipalities are misallocated, consistent with the theoretical prediction interval (52-65 municipalities at 95% confidence). Our framework achieves 95% agreement with complex spatial models at O(n) complexity. Poor absolute fit of global distributions (p-values<0.01) reflects diverse topography (elevation 200-2,400m), motivating spatial stratification. Sensitivity analysis validates the fitted dispersion parameter (s=0.093) for predicting observed misallocation. We provide a calibration protocol requiring only 30-100 pilot samples per zone, enabling rapid risk assessment without full network analysis. This establishes the first probabilistic framework for Voronoi misallocation risk with practical guidelines emphasizing spatial heterogeneity and context-dependent calibration.

IGen: Scalable Data Generation for Robot Learning from Open-World Images

Authors:Chenghao Gu, Haolan Kang, Junchao Lin, Jinghe Wang, Duo Wu, Shuzhao Xie, Fanding Huang, Junchen Ge, Ziyang Gong, Letian Li, Hongying Zheng, Changwei Lv, Zhi Wang

Date:2025-12-01 15:15:04

The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.

Robust Rigid and Non-Rigid Medical Image Registration Using Learnable Edge Kernels

Authors:Ahsan Raza Siyal, Markus Haltmeier, Ruth Steiger, Malik Galijasevic, Elke Ruth Gizewski, Astrid Ellen Grams

Date:2025-12-01 15:13:33

Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.

Forced Migration and Information-Seeking Behavior on Wikipedia: Insights from the Ukrainian Refugee Crisis

Authors:Carolina Coimbra Vieira, Ebru Sanliturk, Emilio Zagheni

Date:2025-12-01 13:58:29

Gathering information about where to migrate is an important part of the migration process, especially during forced migration, when people must make rapid decisions under uncertainty. This study examines how forced migration relates to online information-seeking on Wikipedia. Focusing on the 2022 Russian invasion of Ukraine, we analyze how the resulting refugee crisis, which led to over six million Ukrainians fleeing across Europe, shaped views of Wikipedia articles about European cities. We compare changes in views of Ukrainian-language Wikipedia articles, used as a proxy for information-seeking by Ukrainians, with those in four other language editions. Our findings show that views of Ukrainian-language articles about European cities correlate more strongly with the number of Ukrainian refugees applying for temporary protection in European countries than views in other languages. Because Poland and Germany became the main destinations for refugees, we examine these countries more closely and find that applications for temporary protection in Polish and German cities are also more strongly correlated with views of their Ukrainian-language Wikipedia articles. We further analyze the timing between refugee flows to Poland and online information-seeking. Refugee border crossings occurred before increases in Ukrainian-language views of Polish city articles, indicating that information-seeking surged after displacement. This reactive pattern contrasts with the pre-departure planning typical of regular labor migration. Moreover, while official protection applications often lagged behind border crossings by weeks, Wikipedia activity rose almost immediately. Overall, Wikipedia usage offers a near real-time indicator of emerging migration patterns during crises.