Accurately registering breast MR images from different time points enables the alignment of anatomical structures and tracking of tumor progression, supporting more effective breast cancer detection, diagnosis, and treatment planning. However, the complexity of dense tissue and its highly non-rigid nature pose challenges for conventional registration methods, which primarily focus on aligning general structures while overlooking intricate internal details. To address this, we propose \textbf{GuidedMorph}, a novel two-stage registration framework designed to better align dense tissue. In addition to a single-scale network for global structure alignment, we introduce a framework that utilizes dense tissue information to track breast movement. The learned transformation fields are fused by introducing the Dual Spatial Transformer Network (DSTN), improving overall alignment accuracy. A novel warping method based on the Euclidean distance transform (EDT) is also proposed to accurately warp the registered dense tissue and breast masks, preserving fine structural details during deformation. The framework supports paradigms that require external segmentation models and with image data only. It also operates effectively with the VoxelMorph and TransMorph backbones, offering a versatile solution for breast registration. We validate our method on ISPY2 and internal dataset, demonstrating superior performance in dense tissue, overall breast alignment, and breast structural similarity index measure (SSIM), with notable improvements by over 13.01% in dense tissue Dice, 3.13% in breast Dice, and 1.21% in breast SSIM compared to the best learning-based baseline.
Scientific discovery is driven by the iterative process of background research, hypothesis generation, experimentation, and data analysis. Despite recent advancements in applying artificial intelligence to scientific discovery, no system has yet automated all of these stages in a single workflow. Here, we introduce Robin, the first multi-agent system capable of fully automating the key intellectual steps of the scientific process. By integrating literature search agents with data analysis agents, Robin can generate hypotheses, propose experiments, interpret experimental results, and generate updated hypotheses, achieving a semi-autonomous approach to scientific discovery. By applying this system, we were able to identify a novel treatment for dry age-related macular degeneration (dAMD), the major cause of blindness in the developed world. Robin proposed enhancing retinal pigment epithelium phagocytosis as a therapeutic strategy, and identified and validated a promising therapeutic candidate, ripasudil. Ripasudil is a clinically-used rho kinase (ROCK) inhibitor that has never previously been proposed for treating dAMD. To elucidate the mechanism of ripasudil-induced upregulation of phagocytosis, Robin then proposed and analyzed a follow-up RNA-seq experiment, which revealed upregulation of ABCA1, a critical lipid efflux pump and possible novel target. All hypotheses, experimental plans, data analyses, and data figures in the main text of this report were produced by Robin. As the first AI system to autonomously discover and validate a novel therapeutic candidate within an iterative lab-in-the-loop framework, Robin establishes a new paradigm for AI-driven scientific discovery.
Recent work investigated the use of Reinforcement Learning (RL) for the synthesis of heuristic guidance to improve the performance of temporal planners when a domain is fixed and a set of training problems (not plans) is given. The idea is to extract a heuristic from the value function of a particular (possibly infinite-state) MDP constructed over the training problems. In this paper, we propose an evolution of this learning and planning framework that focuses on exploiting the information provided by symbolic heuristics during both the RL and planning phases. First, we formalize different reward schemata for the synthesis and use symbolic heuristics to mitigate the problems caused by the truncation of episodes needed to deal with the potentially infinite MDP. Second, we propose learning a residual of an existing symbolic heuristic, which is a "correction" of the heuristic value, instead of eagerly learning the whole heuristic from scratch. Finally, we use the learned heuristic in combination with a symbolic heuristic using a multiple-queue planning approach to balance systematic search with imperfect learned information. We experimentally compare all the approaches, highlighting their strengths and weaknesses and significantly advancing the state of the art for this planning and learning schema.
Robotic bin packing aids in a wide range of real-world scenarios such as e-commerce and warehouses. Yet, existing works focus mainly on considering the shape of objects to optimize packing compactness and neglect object properties such as fragility, edibility, and chemistry that humans typically consider when packing objects. This paper presents OPA-Pack (Object-Property-Aware Packing framework), the first framework that equips the robot with object property considerations in planning the object packing. Technical-wise, we develop a novel object property recognition scheme with retrieval-augmented generation and chain-of-thought reasoning, and build a dataset with object property annotations for 1,032 everyday objects. Also, we formulate OPA-Net, aiming to jointly separate incompatible object pairs and reduce pressure on fragile objects, while compacting the packing. Further, OPA-Net consists of a property embedding layer to encode the property of candidate objects to be packed, together with a fragility heightmap and an avoidance heightmap to keep track of the packed objects. Then, we design a reward function and adopt a deep Q-learning scheme to train OPA-Net. Experimental results manifest that OPA-Pack greatly improves the accuracy of separating incompatible object pairs (from 52% to 95%) and largely reduces pressure on fragile objects (by 29.4%), while maintaining good packing compactness. Besides, we demonstrate the effectiveness of OPA-Pack on a real packing platform, showcasing its practicality in real-world scenarios.
Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code, data, and trained models will be made available.
Large uncertainties in the energy transition urge decision-makers to develop low-regret strategies, i.e., strategies that perform well regardless of how the future unfolds. To address this challenge, we introduce a decision-support framework that identifies low-regret strategies in energy system planning under uncertainty. Our framework (i) automatically identifies strategies, (ii) evaluates their performance in terms of regret, (iii) assesses the key drivers of regret, and (iv) supports the decision process with intuitive decision trees, regret curves and decision maps. We apply the framework to evaluate the optimal use of biomass in the transition to net-zero energy systems, considering all major biomass utilization options: biofuels, biomethane, chemicals, hydrogen, biochar, electricity, and heat. Producing fuels and chemicals from biomass performs best across various decision-making criteria. In contrast, the current use of biomass, mainly for low-temperature heat supply, results in high regret, making it a must-avoid in the energy transition.
The advent of the internet has been both a blessing and a curse for once marginalised communities. When used well, the internet can be used to connect and establish communities crossing different intersections; however, it can also be used as a tool to alienate people and communities as well as perpetuate hate, misinformation, and disinformation especially on social media platforms. We propose steering hate speech research and researchers away from pre-existing computational solutions and consider social methods to inform social solutions to address this social problem. In a similar way linguistics research can inform language planning policy, linguists should apply what we know about language and society to mitigate some of the emergent risks and dangers of anti-social behaviour in digital spaces. We argue linguists and NLP researchers can play a principle role in unleashing the social impact potential of linguistics research working alongside communities, advocates, activists, and policymakers to enable equitable digital inclusion and to close the digital divide.
Accurate detection of changes in roads and bridges, such as construction, renovation, and demolition, is essential for urban planning and traffic management. However, existing methods often struggle to extract fine-grained semantic change information due to the lack of high-quality annotated datasets in traffic scenarios. To address this, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset, a comprehensive benchmark comprising 260 pairs of high-resolution remote sensing images from diverse cities and countries. RB-SCD captures 11 types of semantic changes across varied road and bridge structures, enabling detailed structural and functional analysis. Building on this dataset, we propose a novel framework, Multimodal Frequency-Driven Change Detector (MFDCD), which integrates multimodal features in the frequency domain. MFDCD includes a Dynamic Frequency Coupler (DFC) that fuses hierarchical visual features with wavelet-based frequency components, and a Textual Frequency Filter (TFF) that transforms CLIP-derived textual features into the frequency domain and applies graph-based filtering. Experimental results on RB-SCD and three public benchmarks demonstrate the effectiveness of our approach.
Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.
The Atacama Large Millimeter Array (ALMA) has revolutionised the field of solar millimetre astronomy with its high angular resolution and cadence. However, with a limited field of view (FOV), targeted observations of highly dynamic phenomena such of flares are challenging. A large aperture single-dish telescope with a large FOV, such as the future Atacama Large Aperture Submillimeter Telescope (AtLAST), would prove useful in observing such phenomena, as one could scan the full solar disk on shorter timescales. We aimed to explore what FOVs, detector counts, and scan strategies are suitable for AtLAST to push the required full-disk scan times below 1 minute, enabling regular observations of dynamic solar phenomena. Utilising the maria code, we were able to simulate solar observations with AtLAST, and thoroughly explored how instrumental properties and scanning strategies affect the full-disk observations in the planned frequency bands. We find the double-circle scan pattern, currently employed at ALMA for full-disk mapping to also be an acceptable way of scanning the Sun with AtLAST. Using small to intermediately sized instruments (1000 - 50,000 detector elements), the estimated observational cadence would be less than 1 minute across AtLAST's frequency range with a reasonable pixel spacing. Using instruments with larger FOVs ($\gtrapprox 0.25^\circ$, equivalent to $\gtrapprox$ 1 R$_\odot$), we find a simple circular scan to be more efficient, achieving cadences on second time scales, but requiring more detector elements ($\gtrapprox$ 100,000). We find that a large FOV single-dish telescope such as AtLAST could provide the solar millimetre community with hitherto unachievable observations, namely full-disk observations at high cadence and adequate resolution. With cadences potentially down to seconds, such an instrument would be ideal in the study of quickly evolving solar phenomena.
Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.
Transcranial focused ultrasound (tFUS) is an emerging modality for non-invasive brain stimulation and therapeutic intervention, offering millimeter-scale spatial precision and the ability to target deep brain structures. However, the heterogeneous and anisotropic nature of the human skull introduces significant distortions to the propagating ultrasound wavefront, which require time-consuming patient-specific planning and corrections using numerical solvers for accurate targeting. To enable data-driven approaches in this domain, we introduce TFUScapes, the first large-scale, high-resolution dataset of tFUS simulations through anatomically realistic human skulls derived from T1-weighted MRI images. We have developed a scalable simulation engine pipeline using the k-Wave pseudo-spectral solver, where each simulation returns a steady-state pressure field generated by a focused ultrasound transducer placed at realistic scalp locations. In addition to the dataset, we present DeepTFUS, a deep learning model that estimates normalized pressure fields directly from input 3D CT volumes and transducer position. The model extends a U-Net backbone with transducer-aware conditioning, incorporating Fourier-encoded position embeddings and MLP layers to create global transducer embeddings. These embeddings are fused with U-Net encoder features via feature-wise modulation, dynamic convolutions, and cross-attention mechanisms. The model is trained using a combination of spatially weighted and gradient-sensitive loss functions, enabling it to approximate high-fidelity wavefields. The TFUScapes dataset is publicly released to accelerate research at the intersection of computational acoustics, neurotechnology, and deep learning. The project page is available at https://github.com/CAMMA-public/TFUScapes.
This article explores the use of artificial intelligence for the diagnosis of pathologies of the temporomandibular joint (TMJ), in particular, for the segmentation of the articular disc on MRI images. The relevance of the work is due to the high prevalence of TMJ pathologies, as well as the need to improve the accuracy and speed of diagnosis in medical institutions. During the study, the existing solutions (Diagnocat, MandSeg) were analyzed, which, as a result, are not suitable for studying the articular disc due to the orientation towards bone structures. To solve the problem, an original dataset was collected from 94 images with the classes "temporomandibular joint" and "jaw". To increase the amount of data, augmentation methods were used. After that, the models of U-Net, YOLOv8n, YOLOv11n and Roboflow neural networks were trained and compared. The evaluation was carried out according to the Dice Score, Precision, Sensitivity, Specificity, and Mean Average Precision metrics. The results confirm the potential of using the Roboflow model for segmentation of the temporomandibular joint. In the future, it is planned to develop an algorithm for measuring the distance between the jaws and determining the position of the articular disc, which will improve the diagnosis of TMJ pathologies.
Legged robots have the potential to leverage obstacles to climb steep sand slopes. However, efficiently repositioning these obstacles to desired locations is challenging. Here we present DiffusiveGRAIN, a learning-based method that enables a multi-legged robot to strategically induce localized sand avalanches during locomotion and indirectly manipulate obstacles. We conducted 375 trials, systematically varying obstacle spacing, robot orientation, and leg actions in 75 of them. Results show that the movement of closely-spaced obstacles exhibits significant interference, requiring joint modeling. In addition, different multi-leg excavation actions could cause distinct robot state changes, necessitating integrated planning of manipulation and locomotion. To address these challenges, DiffusiveGRAIN includes a diffusion-based environment predictor to capture multi-obstacle movements under granular flow interferences and a robot state predictor to estimate changes in robot state from multi-leg action patterns. Deployment experiments (90 trials) demonstrate that by integrating the environment and robot state predictors, the robot can autonomously plan its movements based on loco-manipulation goals, successfully shifting closely located rocks to desired locations in over 65% of trials. Our study showcases the potential for a locomoting robot to strategically manipulate obstacles to achieve improved mobility on challenging terrains.
Early cavemen relied on gestures, vocalizations, and simple signals to coordinate, plan, avoid predators, and share resources. Today, humans collaborate using complex languages to achieve remarkable results. What drives this evolution in communication? How does language emerge, adapt, and become vital for teamwork? Understanding the origins of language remains a challenge. A leading hypothesis in linguistics and anthropology posits that language evolved to meet the ecological and social demands of early human cooperation. Language did not arise in isolation, but through shared survival goals. Inspired by this view, we investigate the emergence of language in multi-agent Foraging Games. These environments are designed to reflect the cognitive and ecological constraints believed to have influenced the evolution of communication. Agents operate in a shared grid world with only partial knowledge about other agents and the environment, and must coordinate to complete games like picking up high-value targets or executing temporally ordered actions. Using end-to-end deep reinforcement learning, agents learn both actions and communication strategies from scratch. We find that agents develop communication protocols with hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality. We quantify each property and analyze how different factors, such as population size and temporal dependencies, shape specific aspects of the emergent language. Our framework serves as a platform for studying how language can evolve from partial observability, temporal reasoning, and cooperative goals in embodied multi-agent settings. We will release all data, code, and models publicly.
Understanding people's preferences and needs is crucial for urban planning decisions, yet current approaches often combine them from multi-cultural and multi-city populations, obscuring important demographic differences and risking amplifying biases. We conducted a large-scale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and, for the first time, personality traits -- shape perceptions among 1,000 participants, with balanced demographics, from five countries and 45 nationalities. This dataset, introduced as Street Perception Evaluation Considering Socioeconomics (SPECS), exhibits statistically significant differences in perception scores in six traditionally used indicators (safe, lively, wealthy, beautiful, boring, and depressing) and four new ones we propose (live nearby, walk, cycle, green) among demographics and personalities. We revealed that location-based sentiments are carried over in people's preferences when comparing urban streetscapes with other cities. Further, we compared the perception scores based on where participants and streetscapes are from. We found that an off-the-shelf machine learning model trained on an existing global perception dataset tends to overestimate positive indicators and underestimate negative ones compared to human responses, suggesting that targeted intervention should consider locals' perception. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.
Object-goal navigation (ON) enables autonomous robots to locate and reach user-specified objects in previously unknown environments, offering promising applications in domains such as assistive care and disaster response. Existing ON methods -- including training-free approaches, reinforcement learning, and zero-shot planners -- generally depend on active exploration to identify landmark objects (e.g., kitchens or desks), followed by navigation toward semantically related targets (e.g., a specific mug). However, these methods often lack strategic planning and do not adequately address trade-offs among multiple objectives. To overcome these challenges, we propose a novel framework that formulates ON as a multi-objective optimization problem (MOO), balancing frontier-based knowledge exploration with knowledge exploitation over previously observed landmarks; we call this framework MOON (MOO-driven ON). We implement a prototype MOON system that integrates three key components: (1) building on QOM [IROS05], a classical ON system that compactly and discriminatively encodes landmarks based on their semantic relevance to the target; (2) integrating StructNav [RSS23], a recently proposed training-free planner, to enhance the navigation pipeline; and (3) introducing a variable-horizon set orienteering problem formulation to enable global optimization over both exploration and exploitation strategies. This work represents an important first step toward developing globally optimized, next-generation object-goal navigation systems.
We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences.
High-resolution wind information is essential for wind energy planning and power forecasting, particularly in regions with complex terrain. However, most AI-based weather forecasting models operate at kilometer-scale resolution, constrained by the reanalysis datasets they are trained on. Here we introduce FuXi-CFD, an AI-based downscaling framework designed to generate detailed three-dimensional wind fields at 30-meter horizontal resolution, using only coarse-resolution atmospheric inputs. The model is trained on a large-scale dataset generated via computational fluid dynamics (CFD), encompassing a wide range of terrain types, surface roughness, and inflow conditions. Remarkably, FuXi-CFD predicts full 3D wind structures -- including vertical wind and turbulent kinetic energy -- based solely on horizontal wind input at 10 meters above ground, the typical output of AI-based forecast systems. It achieves CFD-comparable accuracy while reducing inference time from hours to seconds. By bridging the resolution gap between regional forecasts and site-specific wind dynamics, FuXi-CFD offers a scalable and operationally efficient solution to support the growing demands of renewable energy deployment.
Vehicular metaverses are an emerging paradigm that merges intelligent transportation systems with virtual spaces, leveraging advanced digital twin and Artificial Intelligence (AI) technologies to seamlessly integrate vehicles, users, and digital environments. In this paradigm, vehicular AI agents are endowed with environment perception, decision-making, and action execution capabilities, enabling real-time processing and analysis of multi-modal data to provide users with customized interactive services. Since vehicular AI agents require substantial resources for real-time decision-making, given vehicle mobility and network dynamics conditions, the AI agents are deployed in RoadSide Units (RSUs) with sufficient resources and dynamically migrated among them. However, AI agent migration requires frequent data exchanges, which may expose vehicular metaverses to potential cyber attacks. To this end, we propose a reliable vehicular AI agent migration framework, achieving reliable dynamic migration and efficient resource scheduling through cooperation between vehicles and RSUs. Additionally, we design a trust evaluation model based on the theory of planned behavior to dynamically quantify the reputation of RSUs, thereby better accommodating the personalized trust preferences of users. We then model the vehicular AI agent migration process as a partially observable markov decision process and develop a Confidence-regulated Generative Diffusion Model (CGDM) to efficiently generate AI agent migration decisions. Numerical results demonstrate that the CGDM algorithm significantly outperforms baseline methods in reducing system latency and enhancing robustness against cyber attacks.
Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.\footnote{We have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time.} Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.
Verification and validation (V&V) of autonomous vehicles (AVs) typically requires exhaustive testing across a variety of operating environments and driving scenarios including rare, extreme, or hazardous situations that might be difficult or impossible to capture in reality. Additionally, physical V&V methods such as track-based evaluations or public-road testing are often constrained by time, cost, and safety, which motivates the need for virtual proving grounds. However, the fidelity and scalability of simulation-based V&V methods can quickly turn into a bottleneck. In such a milieu, this work proposes a virtual proving ground that flexibly scales digital twins within high-performance computing clusters (HPCCs) and automates the V&V process. Here, digital twins enable high-fidelity virtual representation of the AV and its operating environments, allowing extensive scenario-based testing. Meanwhile, HPCC infrastructure brings substantial advantages in terms of computational power and scalability, enabling rapid iterations of simulations, processing and storage of massive amounts of data, and deployment of large-scale test campaigns, thereby reducing the time and cost associated with the V&V process. We demonstrate the efficacy of this approach through a case study that focuses on the variability analysis of a candidate autonomy algorithm to identify potential vulnerabilities in its perception, planning, and control sub-systems. The modularity, scalability, and interoperability of the proposed framework are demonstrated by deploying a test campaign comprising 256 test cases on two different HPCC architectures to ensure continuous operation in a publicly shared resource setting. The findings highlight the ability of the proposed framework to accelerate and streamline the V&V process, thereby significantly compressing (~30x) the timeline.
This study presents a dynamic safety margin-based reinforcement learning framework for local motion planning in dynamic and uncertain environments. The proposed planner integrates real-time trajectory optimization with adaptive gap analysis, enabling effective feasibility assessment under partial observability constraints. To address safety-critical computations in unknown scenarios, an enhanced online learning mechanism is introduced, which dynamically corrects spatial trajectories by forming dynamic safety margins while maintaining control invariance. Extensive evaluations, including ablation studies and comparisons with state-of-the-art algorithms, demonstrate superior success rates and computational efficiency. The framework's effectiveness is further validated on both simulated and physical robotic platforms.
Long-duration energy storage (LDES) is a key component for fully renewable, sector-coupled energy systems based on wind and solar. While capacity expansion planning has begun to take into account interannual weather variability, it often ignores weather uncertainty and limited foresight in capacity and operational decisions. We build a stochastic capacity expansion model for fully decarbonized energy systems with LDES in Europe accounting for weather uncertainty - isolating the effect of limited foresight by comparing it to a perfect foresight benchmark. Under limited foresight, LDES acts as a hedge against extreme system states operating defensively and exhibiting a stockpiling effect absent under perfect foresight. Solar PV gains in system value for its higher predictability with up to 29\% higher capacities versus the benchmark while onshore wind capacities are lower. We shed light on the underlying mechanisms by deriving implicit LDES bidding curves. We show that LDES bids reflect the costs and the weather-dependent probability of extreme system states conditional on the current system state. This has important implications for the price formation on renewable electricity markets, as a wide and continuous range of probabilistic LDES bids alleviates concerns of extreme price disparity at high renewable shares.
This study deals with the problem of task and motion planning of autonomous systems within the context of high-level tasks. Specifically, a task comprises logical requirements (conjunctions, disjunctions, and negations) on the trajectories and final states of agents in certain regions of interest. We propose an optimal planning approach that combines offline computation and online planning. First, a simplified Petri net system is proposed to model the autonomous system. Then, indicating places are designed to implement the logical requirements of the specifications. Building upon this, a compact representation of the state space called extended basis reachability graph is constructed and an efficient online planning algorithm is developed to obtain the optimal plan. It is shown that the most burdensome part of the planning procedure may be removed offline, thanks to the construction of the extended basis reachability graph. Finally, series of simulations are conducted to demonstrate the computational efficiency and scalability of our developed method.
This study investigates a stochastic production planning problem with a running cost composed of quadratic production costs and inventory-dependent costs. The objective is to minimize the expected cost until production stops when inventory reaches a specified level, subject to a boundary condition. Using probability space and Brownian motion, the Hamilton-Jacobi-Bellman (HJB) equation is derived, and optimal feedback control is obtained. The solution demonstrates desirable monotonicity and convexity properties under specific assumptions. An illustrative example further confirms these results with explicit function properties and a practical application.
We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario.
Despite over a decade of development, autonomous driving trajectory planning in complex urban environments continues to encounter significant challenges. These challenges include the difficulty in accommodating the multi-modal nature of trajectories, the limitations of single expert in managing diverse scenarios, and insufficient consideration of environmental interactions. To address these issues, this paper introduces the EMoE-Planner, which incorporates three innovative approaches. Firstly, the Explicit MoE (Mixture of Experts) dynamically selects specialized experts based on scenario-specific information through a shared scene router. Secondly, the planner utilizes scene-specific queries to provide multi-modal priors, directing the model's focus towards relevant target areas. Lastly, it enhances the prediction model and loss calculation by considering the interactions between the ego vehicle and other agents, thereby significantly boosting planning performance. Comparative experiments were conducted using the Nuplan dataset against the state-of-the-art methods. The simulation results demonstrate that our model consistently outperforms SOTA models across nearly all test scenarios.
The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
Recent advances in reinforcement learning (RL) have demonstrated impressive capabilities in complex decision-making tasks. This progress raises a natural question: how do these artificial systems compare to biological agents, which have been shaped by millions of years of evolution? To help answer this question, we undertake a comparative study of biological mice and RL agents in a predator-avoidance maze environment. Through this analysis, we identify a striking disparity: RL agents consistently demonstrate a lack of self-preservation instinct, readily risking ``death'' for marginal efficiency gains. These risk-taking strategies are in contrast to biological agents, which exhibit sophisticated risk-assessment and avoidance behaviors. Towards bridging this gap between the biological and artificial, we propose two novel mechanisms that encourage more naturalistic risk-avoidance behaviors in RL agents. Our approach leads to the emergence of naturalistic behaviors, including strategic environment assessment, cautious path planning, and predator avoidance patterns that closely mirror those observed in biological systems.