planning - 2026-05-15

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Authors:Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee
Date:2026-05-14 17:58:19

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

CLOVER: Closed-Loop Value Estimation \& Ranking for End-to-End Autonomous Driving Planning

Authors:Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang
Date:2026-05-14 17:32:18

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

From Data to Action: Accelerating Refinery Optimization with AI

Authors:Dániel Pfeifer, Ábrahám Papp, Tibor Bernáth, Tamás Zoltán Varga, Márk Czifra, Botond Szilágyi, Edith Alice Kovács
Date:2026-05-14 17:07:41

Nowadays refinery optimization utilizes sheer amounts of data, which can be handled with modern Linear Programming (LP) software, but the interpreting and applying the results remains challenging. Large petrochemical companies use massive models, with hundreds of thousands of input matrix elements. The LP solution is mathematically correct, but simplifications are made in the model, and data supply errors may occur. Therefore, further insight is needed to trust the results. The LP solver does not have a memory, so additional understanding could be gained by analyzing historical data and comparing it to the current plan. As such, machine learning approaches were suggested to support decision making based on the LP solution. Among these, Anomaly Detection tools are proposed to be used in tandem with the LP output. A transformed version of the popular ECOD methodology is applied. New methods are proposed to handle high-dimensional data: choosing the most informative pairs. Then, this is used alongside two 2D Anomaly Detection algorithms, revealing several business opportunities and data supply errors in the MOL refinery scheduling and planning architecture.

SOCC-ICP: Semantics-Assisted Odometry based on Occupancy Grids and ICP

Authors:Johannes Scherer, Sebastian Hirt, Henri Meeß
Date:2026-05-14 17:00:38

Reliable pose estimation in previously unseen environments is a fundamental capability of autonomous systems. Existing LiDAR odometry methods typically employ point-, surfel-, or NDT-based map representations, which are distinct from the semantic occupancy grids commonly used for downstream tasks such as motion planning. We introduce SOCC-ICP, a semantics-assisted odometry framework that jointly performs Semantic OCCupancy grid mapping and LiDAR scan alignment. Each map voxel encodes geometric and semantic statistics, enabling adaptive point-to-point or point-to-plane ICP based on local planarity. Further, the occupancy grid naturally filters dynamic objects through raycasting-based free-space updates. Across diverse evaluation scenarios, SOCC-ICP achieves performance competitive with state-of-the-art LiDAR odometry and remains robust in geometrically degenerate environments, even in the absence of semantic cues. When semantic labels are available, integrating them into map construction, downsampling, and correspondence weighting yields further accuracy gains. By unifying odometry and semantic occupancy grid mapping within a single representation, SOCC-ICP eliminates redundant map structures and directly provides a map suitable for downstream robotic applications.

Sophie Germain, mathématicienne extraordinaire: A story stranger than fiction

Authors:David Pengelley
Date:2026-05-14 16:39:14

Sophie Germain (1776-1831) was the first woman we know who did important original research in mathematics, specifically in elasticity theory and number theory. Celebrating her semiquincentennial year, we outline Germain's recently unearthed number theory results on Fermat's Last Theorem, in the context of her life, work, and interactions with Lagrange, Legendre, and Gauss. For two centuries her accomplishment on Fermat's Last Theorem was thought to consist of a single theorem attributed to her in a publication by Legendre, the first general result towards proving Fermat's Last Theorem. But recent discoveries in her handwritten manuscripts and correspondence with Legendre and Gauss show that she accomplished much more, albeit forgotten. In particular, she had a grand plan for proving Fermat's Last Theorem in its entirety, and carried this plan a long way, using then new tools, e.g., congruence, modular primitive roots, and permutations.

Multi-Agentic Approach for History Matching of Oil Reservoirs

Authors:Linar Samigullin, Sergei Shumilin, Evgeny Burnaev
Date:2026-05-14 16:25:51

History matching is a central inverse problem in reservoir engineering, where uncertain reservoir parameters must be calibrated against observations. Although automated history matching can reduce manual effort, practical deployment remains difficult because engineers must still configure heterogeneous workflows involving parameter selection, physically admissible bounds, optimizer choice, hyperparameter tuning, simulator execution, and diagnostic reporting. We propose PetroGraph, a multi-agent framework for intelligent reservoir history matching that decomposes this workflow into specialized agents for model review, experimental planning, parameterization, optimization, simulation, and summarization. The system combines large language model agents with domain-specific tools, retrieval-augmented access to simulator documentation, validation of modified ECLIPSE input decks, human-in-the-loop checkpoints, and an OPM Flow-based simulation backend. This design enables users to initiate and steer history matching through natural language while preserving explicit control over selected parameters and optimization settings. We evaluate PetroGraph on three reservoir models of increasing complexity: the synthetic SPE1 model, the faulted SPE9 benchmark, and the real-field Norne model. Using weighted normalized root mean square error as the objective, PetroGraph reduces the mismatch by 95% on SPE1, 69% on SPE9, and 13% on Norne. These results demonstrate that multi-agent orchestration can automate key decisions in history matching, lower the expertise barrier for operating complex simulation workflows, and provide a flexible foundation for extensible, domain-aware reservoir model adaptation.

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Authors:Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu
Date:2026-05-14 16:17:35

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation

Authors:Drewry H. Morris, Luis Valles, Reza Hosseini Ghomi
Date:2026-05-14 15:33:05

GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.

Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

Authors:Jonathan Spieler, Angel Villar-Corrales, Sven Behnke
Date:2026-05-14 15:12:15

Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.

Relaxation of Projected Prior with Continuous Gap Shrinkage

Authors:Leo L Duan, Sunghyun Cho, Mingzhang Yin
Date:2026-05-14 15:11:06

Projected priors were originally introduced to accommodate parameter constraints, but have recently regained popularity due to their ability to assign probability mass to low-dimensional parameter sets, such as the spaces of sparse vectors, directed acyclic graphs, or transport plans. When employed as a transformation of random variables, projection is especially useful, since its contraction property not only preserves probability concentration, but also often preserves differentiability for gradient-based posterior computation. On the other hand, unless the projection can be obtained by some non-iterative algorithm, posterior computation can be expensive because it requires nesting an iterative optimization routine within each Markov chain Monte Carlo iteration. In this article, inspired by the success of continuous shrinkage models as replacements for discrete spike-and-slab priors, we propose a continuous relaxation of projected priors. The key idea is to quantify the duality gap between the primal projection loss and the dual objective, and impose a probabilistic prior that shrinks this gap toward zero. The resulting gap-shrinkage prior has a tractable form, does not require running an optimization subroutine inside each posterior update, and puts probability mass near the exact projection. We demonstrate useful properties of gap-shrinkage priors, including connections to global-local shrinkage priors, broad applicability to generalized projection functions, and competitive performance in posterior contraction. We apply the gap-shrinkage model to a marketing data analysis aimed at identifying important predictor effects on multivariate grocery-shopping decisions.

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Authors:Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen
Date:2026-05-14 14:58:46

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Authors:Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du
Date:2026-05-14 14:22:25

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

Authors:Yu Zhang, Dongjiang Zhuang, Qu Zhou, Zheng Huang, Junhe Wu, Jing Cao, Kai Chen
Date:2026-05-14 14:04:46

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale

Authors:Vineet Kotecha, Vansh Gupta
Date:2026-05-14 13:39:19

Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary

Learning Direct Control Policies with Flow Matching for Autonomous Driving

Authors:Marcello Ceresini, Federico Pirazzoli, Andrea Bertogalli, Lorenzo Cipelli, Filippo D'Addeo, Anthony Dell'Eva, Alessandro Paolo Capasso, Alberto Broggi
Date:2026-05-14 13:37:48

We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at https://marcelloceresini.github.io/DirectControlFlowMatching.

Addressing Terminal Constraints in Data-Driven Demand Response Scheduling

Authors:Maximilian Bloor, Martha White, Ehecatl Antonio del Rio Chanona, Calvin Tsay
Date:2026-05-14 12:07:53

Electrified chemical processes are incentivized by exposure to time-varying electricity markets to operate flexibly, but participating in demand response schemes can require satisfying terminal constraints over long horizons. Specifically, terminal constraints may be required when computing optimal schedules in order to preserve dynamic stability. Model-based optimization methods are computationally costly, and data-driven scheduling via reinforcement learning (RL) faces severe credit-assignment challenges. We integrate Goal-Space Planning (GSP) with Deep Deterministic Policy Gradient (DDPG), using learned temporally abstract models over discrete subgoals to propagate value across extended horizons. Using a simulated air separation benchmark, we demonstrate the proposed approach improves sample efficiency over standard DDPG while satisfying terminal storage constraints, mitigating myopic control behavior.

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

Authors:Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, Li Niu
Date:2026-05-14 11:27:46

Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.

SmartWalkCoach: An AI Companion for End-to-End Walking Guidance, Motivation, and Reflection

Authors:Xianzhe Zhang, Mingxuan Hu, Bufan Xue, Erick Purwanto, Thomas J Selig, Daniel Yonto
Date:2026-05-14 09:40:35

We present SmartWalkCoach, a mobile AI companion that supports the full walking journey: from pre-walk planning to in-walk guidance through to post-walk reflection. Addressing a gap between map navigation and motivational coaching, SmartWalkCoach orchestrates three lightweight agents: (1) GeographyAgent for conversational route curation from nearby points of interest and user preferences while delegating pathfinding to map APIs; (2) AccompanyAgent for context-aware, just-in-time prompts that blend informational cues with relational encouragement; and (3) SummaryAgent for concise reflection and next-step planning. This end-to-end, tool-using design aims to lower cognitive load in planning and sustain engagement and motivation during walking through delivering dynamic, cadence-aware interventions. We conducted an in-the-wild, two-period AB/BA crossover study (N=12), where each participant completed two comparable walks with counterbalanced conditions: Information-only versus Information+Motivation. Linear mixed models show that adding motivational, companion-like dialogue significantly improved outcomes: participants reported higher positive feelings and better user experience, with no evidence of carryover. Thematic analysis surfaced two design imperatives for mobile companions: supportive, relational expression and context-aware timing (e.g., avoiding high-load moments, intervening at fatigue/milestones). Our contributions are: (i) an end-to-end, tool-using agent architecture for everyday walking that reduces cognitive load during planning and accompaniment; (ii) a controlled field evaluation linking context-aware motivation to affect and UX gains; and (iii) actionable design guidance on expression, timing, and frequency for mHealth companions.We outline limitations and paths toward multimodal, voice-first companions, with adaptive personalization mechanisms.

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Authors:Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, Jing Liu
Date:2026-05-14 07:47:53

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Authors:Jiashun Zhu, Ronghao Fu, Jiasen Hu, Nachuan Xing, Xu Na, Xiao Yang, Zhiwen Lin, Weipeng Zhang, Lang Sun, Zhiheng Xue, Haoran Liu, Weijie Zhang, Bo Yang
Date:2026-05-14 07:15:46

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Authors:Yize Liu, Siyuan Yan, Ming Hu, Lie Ju, Xieji Li, Feilong Tang, Wei Feng, Zongyuan Ge
Date:2026-05-14 05:41:11

Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.

Coding Agent Is Good As World Simulator

Authors:Hongyu Wang, Jingquan Wang, Bocheng Zou, Radu Serban, Dan Negrut
Date:2026-05-14 05:33:41

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

Authors:Chenyi Wang, Ruoyu Song, Raymond Muller, Jean-Philippe Monteuuis, Jonathan Petit, Z. Berkay Celik, Ryan Gerdes, Ming F. Li
Date:2026-05-14 05:21:17

Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence

Authors:Vidya K Sudarshan, Anushka Sisodia, Reshma A Ramachandra, Sia Batra, Josephine Chong Leng Leng
Date:2026-05-14 02:11:07

Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco-system-level agentic multi-agent AI platform capable of coordinated planning, reasoning, and adaptive decision-making across multiple educational functions. This paper presents a forward-looking perspective on agentic multi-agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task-specific fragmented AI tools, the transition from single-agent to multi-agent systems, limited cross-functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner-centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human-aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi-agent platform can support diverse learners through adaptive, multimodal interventions.

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Authors:Yushen Liu, Yin-Jen Chen, Ziyi Chen, Tao Wang, Heng Huang, Xugui Zhou, Yanfu Zhang
Date:2026-05-14 01:23:09

Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

Reactive Planning based Control for Mobile Robots in Obstacle-Cluttered Environments

Authors:Li Tan, Junlin Xiong, Yan Wang, Wei Ren
Date:2026-05-14 00:57:28

This paper addresses the motion control problem for mobile robots in obstacle-cluttered environments. The mobile robot has partial environment information only, and aims to move from an initial position to a target position without collisions. For this purpose, a reactive planning based control strategy (RPCS) is proposed. First, the initial and target positions are connected as a reference trajectory. Then, a reactive planning strategy (RPS) is developed to ensure the collision avoidance by modifying the reference trajectory locally based on the partial environment information. Next, an adaptive tracking control strategy (ATCS) is proposed to track the reference trajectory with potentially local modifications via the discretization techniques. Finally, the RPS and ATCS are combined to establish the RPCS, whose efficacy and advantages are illustrated by numerical examples.

ASH: Agents that Self-Hone via Embodied Learning

Authors:Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun
Date:2026-05-14 00:10:12

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Authors:Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai
Date:2026-05-13 23:35:14

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

Motion Planning for Autonomous Vehicles using Optimization over Graphs of Convex Sets

Authors:Matheus Wagner, Antônio Augusto Fröhlich
Date:2026-05-13 23:29:54

Motion planning for autonomous vehicles requires generating collision-free and dynamically feasible trajectories in complex environments under real-time constraints. While nonlinear optimal control formulations provide high-fidelity solutions, they are computationally demanding and sensitive to initialization, whereas geometric planning methods scale well but often decouple path selection from trajectory optimization. This paper studies the extent to which optimization over Graphs of Convex Sets (GCS) can approximate solutions of nonlinear optimal control problems in the context of autonomous driving. The free space is represented as a finite union of convex regions organized as a directed graph, allowing nonconvex geometry to be handled through discrete connectivity decisions while maintaining convex trajectory constraints within each region. Vehicle motion is parameterized using Bezier curves for the spatial path and a polynomial time-scaling function for temporal evolution. Under small-slip and linear tire assumptions, a simplified dynamic bicycle model enables approximate enforcement of dynamic feasibility through convex constraints on trajectory derivatives. The approach is evaluated in CommonRoad scenarios involving static obstacle avoidance and lane-changing maneuvers, and is compared against a nonlinear discrete-time optimal control formulation. The results indicate that the GCS-based method generates collision-free and dynamically consistent trajectories that closely match those obtained from the nonlinear program, while exhibiting improved computational efficiency and reduced sensitivity to initialization. These findings suggest that GCS provides a structured approximation of nonlinear motion planning problems, capturing dominant geometric and dynamic effects while preserving convexity in the continuous relaxation.

TSAgent: An Agentic Workflow for Autonomous Transition State Search

Authors:Varun Madhavan, Ankit Mathanker, Dean M. Sweeney, Oluwatosin A. Ohiro, Yixin Wang, Bryan R. Goldsmith
Date:2026-05-13 22:08:24

Identifying transition states (TSs) on potential energy surfaces is a central computational bottleneck in mechanistic studies of catalytic materials. A TS search is not a single calculation but a long-horizon, multi-step workflow of atomistic simulations with delayed, asynchronous feedback and heterogeneous failure modes that require a joint multimodal analysis of scalar convergence diagnostics and atomic geometries along the reaction path. To address this challenge, we propose TSAgent, an agentic workflow that automates TS search directly at the density functional theory (DFT) level of quantum chemical accuracy. TSAgent operates through a persistent plan-execute-analyze-replan loop, continuously adapting its strategy based on convergence diagnostics and geometric feedback without human intervention. We evaluate TSAgent on a diverse 100-example subset of the OC20NEB heterogeneous catalysis benchmark, where it successfully locates TSs with 83% accuracy. In a direct comparison against expert DFT practitioners on 10 held-out examples, TSAgent achieves a 70% success rate compared to a human-expert average of 73 +/- 12%. Finally, TSAgent independently reproduces Bronsted-Evans-Polanyi scaling relationships for NH3 dissociation on metal and single-atom alloy surfaces from a published heterogeneous catalysis study, demonstrating that its utility extends beyond curated benchmarks to real scientific investigations.