Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.
Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at https://frommarginaltojointpred.github.io/.
Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.
Purpose: To develop clinically relevant test cases for commissioning Model-Based Dose Calculation Algorithms (MBDCAs) for 192Ir High Dose Rate (HDR) gynecologic brachytherapy following the workflow proposed by the TG-186 report and the WGDCAB report 372. Acquisition and Validation Methods: Two cervical cancer intracavitary HDR brachytherapy patient models were created, using either uniformly structured regions or realistic segmentation. The computed tomography (CT) images of the models were converted to DICOM CT images via MATLAB and imported into two Treatment Planning Systems (TPSs) with MBDCA capability. The clinical segmentation was expanded to include additional organs at risk. The actual clinical treatment plan was generally maintained, with the source replaced by a generic 192Ir HDR source. Dose to medium in medium calculations were performed using the MBDCA option of each TPS, and three different Monte Carlo (MC) simulation codes. MC results agreed within statistical uncertainty, while comparisons between MBDCA and MC dose distributions highlighted both strengths and limitations of the studied MBDCAs, suggesting potential approaches to overcome the challenges. Data Format and Usage Notes: The datasets for the developed cases are available online at http://doi.org/ 10.5281/zenodo.15720996. The DICOM files include the treatment plan for each case, TPS, and the corresponding reference MC dose data. The package also contains a TPS- and case-specific user guide for commissioning the MBDCAs, and files needed to replicate the MC simulations. Potential Applications: The provided datasets and proposed methodology offer a commissioning framework for TPSs using MBDCAs, and serve as a benchmark for brachytherapy researchers using MC methods. They also facilitate intercomparisons of MBDCA performance and provide a quality assurance resource for evaluating future TPS software updates.
Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. To address these challenges, we propose LERa - Look, Explain, Replan - a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection - without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look, where LERa generates a scene description and identifies errors; (ii) Explain, where it provides corrective guidance; and (iii) Replan, where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real-world trials with a tabletop manipulator robot, confirm LERa's effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error-aware task execution in robotics. The code is available at https://lera-robo.github.io.
This paper presents a novel corrective \gls{fcuc} formulation for island power systems by implementing data-driven constraint learning to estimate the optimal \gls{ufls}. The Tobit model is presented to estimate the optimal amount of \gls{ufls} using the initial rate of change of frequency. The proposed formulation enables co-optimizing operation costs and \gls{ufls}. The aim is to account for optimal \gls{ufls} occurrences during operation planning, without increasing them. This would potentially reduce system operation costs by relaxing the reserve requirement constraint. The performance of the proposed formulation has been analyzed for a Spanish island power system through various simulations. Different daily demand profiles are analyzed to demonstrate the effectiveness of the proposed formulation. Additionally, a sensitivity analysis is conducted to demonstrate the effects of changing the cost associated with \gls{ufls}. The corrective \gls{fcuc} is shown to be capable of reducing system operation costs without jeopardizing the quality of the frequency response in terms of \gls{ufls} occurrence.
Over nearly two decades, Differential Dynamic Microscopy (DDM) has become a standard technique for extracting dynamic correlation functions from time-lapse microscopy data, with applications spanning colloidal suspensions, polymer solutions, active fluids, and biological systems. In its most common implementation, DDM analyzes image sequences acquired with a conventional microscope equipped with a digital camera, yielding time- and wavevector-resolved information analogous to that obtained in multi-angle Dynamic Light Scattering (DLS). With a widening array of applications and a growing, heterogeneous user base, lowering the technical barrier to performing DDM has become a central objective. In this tutorial article, we provide a step-by-step guide to conducting DDM experiments -- from planning and acquisition to data analysis -- and introduce the open-source software package fastDDM, designed to efficiently process large image datasets. fastDDM employs optimized, parallel algorithms that reduce analysis times by up to four orders of magnitude on typical datasets (e.g., 10,000 frames), thereby enabling high-throughput workflows and making DDM more broadly accessible across disciplines.
Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.
Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL i.e. world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.
Many robotic tasks, such as inverse kinematics, motion planning, and optimal control, can be formulated as optimization problems. Solving these problems involves addressing nonlinear kinematics, complex contact dynamics, and long-horizon planning, each posing distinct challenges for state-of-the-art optimization methods. To efficiently solve a wide range of tasks across varying scenarios, researchers either develop specialized algorithms for the task to achieve, or switch between different frameworks. Monte Carlo Tree Search (MCTS) is a general-purpose decision-making tool that enables strategic exploration across problem instances without relying on task-specific structures. However, MCTS suffers from combinatorial complexity, leading to slow convergence and high memory usage. To address this limitation, we propose \emph{Tensor Train Tree Search} (TTTS), which leverages tensor factorization to exploit the separable structure of decision trees. This yields a low-rank, linear-complexity representation that significantly reduces both computation time and storage requirements. We prove that TTTS can efficiently reach the bounded global optimum within a finite time. Experimental results across inverse kinematics, motion planning around obstacles, multi-stage motion planning, and bimanual whole-body manipulation demonstrate the efficiency of TTTS on a diverse set of robotic tasks.
Since the European Union introduced its Circular Economy (CE) Action Plan in 2015, CE research has expanded rapidly. However, the structure of this emerging field - both in terms of its constituent disciplines and researcher dynamics - remains poorly understood. To address this gap, we analyze over 25,000 CE-related publications from Scopus by combining conventional bibliometric approaches with advanced machine learning techniques, including text embeddings and clustering. This hybrid method enables both a macro-level mapping of research domains and a micro-level investigation of individual researchers' disciplinary backgrounds and collaborations. We classify CE research into 16 distinct clusters, identifying the original disciplines of researchers and visualizing patterns of interdisciplinary collaboration. Building on this foundation, we ask: Which CE-related research domains receive the most attention in academic and policy contexts? And how are different types of interdisciplinary collaboration associated with research impact? Our findings show that research in business and management attracts substantial academic and policy attention, while engineering research - though less visible - tends to achieve higher funding success. This suggests a positive dynamic in which the former draws attention to CE issues and the latter secures the economic resources necessary to realize them. We further demonstrate that CE papers co-authored by researchers from different disciplines tend to show higher research impact than intradisciplinary work. Qualitative case analyses also highlight this tendency. Centered particularly on collaborations between business-oriented and engineering-oriented disciplines, our findings underscore the importance of interdisciplinary efforts in CE research and offer insights for guiding future cross-disciplinary engagement in the field.
Reasoning about the trajectories of multiple, interacting objects is integral to physical reasoning tasks in machine learning. This involves conditions imposed on the objects at different time steps, for instance initial states or desired goal states. Existing approaches in physical reasoning generally rely on autoregressive modeling, which can only be conditioned on initial states, but not on later states. In fields such as planning for reinforcement learning, similar challenges are being addressed with denoising diffusion models. In this work, we propose an object-centric denoising diffusion model architecture for physical reasoning that is translation equivariant over time, permutation equivariant over objects, and can be conditioned on arbitrary time steps for arbitrary objects. We demonstrate how this model can solve tasks with multiple conditions and examine its performance when changing object numbers and trajectory lengths during inference.
Identifying regions affected by disasters is a vital step in effectively managing and planning relief and rescue efforts. Unlike the traditional approaches of manually assessing post-disaster damage, analyzing images of Unmanned Aerial Vehicles (UAVs) offers an objective and reliable way to assess the damage. In the past, segmentation techniques have been adopted to identify post-flood damage in UAV aerial images. However, most of these supervised learning approaches rely on manually annotated datasets. Indeed, annotating images is a time-consuming and error-prone task that requires domain expertise. This work focuses on leveraging self-supervised features to accurately identify flooded regions in UAV aerial images. This work proposes two encoder-decoder-based segmentation approaches, which integrate the visual features learned from DINOv2 with the traditional encoder backbone. This study investigates the generalization of self-supervised features for UAV aerial images. Specifically, we evaluate the effectiveness of features from the DINOv2 model, trained on non-aerial images, for segmenting aerial images, noting the distinct perspectives between the two image types. Our results demonstrate that DINOv2's self-supervised pretraining on natural images generates transferable, general-purpose visual features that streamline the development of aerial segmentation workflows. By leveraging these features as a foundation, we significantly reduce reliance on labor-intensive manual annotation processes, enabling high-accuracy segmentation with limited labeled aerial data.
This study presents a comparative analysis of Systematic Investment Plan (SIP) performance in India's Nifty 50 index over a 22-year period (2003--2024), focusing on the timing of investments -- specifically, the first trading day (FTD) of each month versus the monthly expiry day of Futures and Options (EXP). The research establishes, through mathematical validation, that SIP returns are independent of the invested amount. Leveraging historical index data, the study finds that SIPs timed on F\&O expiry days consistently outperform FTD-SIPs by 0.5--2.5\% annually over short- to medium-term horizons (1--3 years), highlighting a tactical advantage due to expiry-related market volatility. However, this advantage diminishes over longer durations (10--20 years), reaffirming that sustained, long-term investing is the key driver of wealth accumulation. Additionally, the analysis challenges prevailing industry narratives by revealing that the 20-year compounded annual growth rate (CAGR) of Nifty 50 SIPs stands at approximately 6.7\% (pre-tax), significantly lower than the widely advertised 12--15\% returns. This research offers novel insights for retail investors and financial planners on the role of investment timing in SIP strategies within the Indian Indian equity market.
This paper presents a degradation-cost-aware optimization framework for multi-string battery energy storage systems, emphasizing the impact of inhomogeneous subsystem-level aging in operational decision-making. We evaluate four scenarios for an energy arbitrage scenario, that vary in model precision and treatment of aging costs. Key performance metrics include operational revenue, power schedule mismatch, missed revenues, capacity losses, and revenue generated per unit of capacity loss. Our analysis reveals that ignoring heterogeneity of subunits may lead to infeasible dispatch plans and reduced revenues. In contrast, combining accurate representation of degraded subsystems and the consideration of aging costs in the objective function improves operational accuracy and economic efficiency of BESS with heterogeneous aged subunits. The fully informed scenario, which combines aging-cost-aware optimization with precise string-level modeling, achieves 21% higher revenue per unit of SOH loss compared to the baseline scenario. These findings highlight that modeling aging heterogeneity is not just a technical refinement but may become a crucial enabler for maximizing both short-term profitability and long-term asset value in particular for long BESS usage scenarios.
Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.
We evaluate global asymmetry in the luminosities of neutrinos emitted from rapidly-rotating proto-neutron stars (PNS's). We build axisymmetric models of PNS's in mechanical equilibrium with rotation by adding prescribed angular momentum distributions by hand to non-rotational PNS models, which are extracted from a one-dimensional (spherically symmetric) PNS cooling calculation at different times: \(t=2, 6, 10, 20, 30\)s after a supernova explosion. We then conduct two-dimensional (spatially axisymmetric) neutrino transport calculations on top of them with the matter profiles (and the spacetime geometry) fixed. We find for the rapidly-rotating models with \(T/|W|\sim 5\times 10^{-2}\) that the neutrino luminosity changes by \(\sim 3 \% \) depending on the observer position. We give detailed analyses of the neutrino-hemispheres as well as the neutrino luminosities that are defined observer-wise. We also calculate the low-frequency (\(\lesssim 1{\rm Hz}\)) gravitational waves produced by the neutrinos radiated asymmetrically. We find that those gravitational waves, if emitted from the Galactic center, can be detected by planned detectors such as B-DECIGO, DECIGO and AILA. Finally, we look for crossings in the energy-integrated angular distributions in momentum space for the electron neutrino sector, a signature of the fast flavor conversion. We find them near the PNS surface in all models.
We present MOSU, a novel autonomous long-range navigation system that enhances global navigation for mobile robots through multimodal perception and on-road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map-based routing for high-level global path planning and multi-modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi-modalities: LiDAR-based geometric data for precise obstacle avoidance, image-based semantic segmentation for traversability assessment, and Vision-Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi-modal integration improves scene understanding and enhances traversability, allowing the robot to adapt to diverse outdoor conditions. We evaluate our system in real-world on-road environments and benchmark it on the GND dataset, achieving a 10% improvement in traversability on navigable terrains while maintaining a comparable navigation distance to existing global navigation methods.
Biplanar X-ray imaging is widely used in health screening, postoperative rehabilitation evaluation of orthopedic diseases, and injury surgery due to its rapid acquisition, low radiation dose, and straightforward setup. However, 3D volume reconstruction from only two orthogonal projections represents a profoundly ill-posed inverse problem, owing to the intrinsic lack of depth information and irreducible ambiguities in soft-tissue visualization. Some existing methods can reconstruct skeletal structures and Computed Tomography (CT) volumes, they often yield incomplete bone geometry, imprecise tissue boundaries, and a lack of anatomical realism, thereby limiting their clinical utility in scenarios such as surgical planning and postoperative assessment. In this study, we introduce SPIDER, a novel supervised framework designed to reconstruct CT volumes from biplanar X-ray images. SPIDER incorporates tissue structure as prior (e.g., anatomical segmentation) into an implicit neural representation decoder in the form of joint supervision through a unified encoder-decoder architecture. This design enables the model to jointly learn image intensities and anatomical structures in a pixel-aligned fashion. To address the challenges posed by sparse input and structural ambiguity, SPIDER directly embeds anatomical constraints into the reconstruction process, thereby enhancing structural continuity and reducing soft-tissue artifacts. We conduct comprehensive experiments on clinical head CT datasets and show that SPIDER generates anatomically accurate reconstructions from only two projections. Furthermore, our approach demonstrates strong potential in downstream segmentation tasks, underscoring its utility in personalized treatment planning and image-guided surgical navigation.
Stormwater infrastructures are decentralized urban water-management systems that face highly unsteady hydraulic and pollutant loadings from episodic rainfall-runoff events. Accurately evaluating their in-situ treatment performance is essential for cost-effective design and planning. Traditional lumped dynamic models (e.g., continuously stirred tank reactor, CSTR) are computationally efficient but oversimplify transport and reaction processes, limiting predictive accuracy and insight. Computational fluid dynamics (CFD) resolves detailed turbulent transport and pollutant fate physics but incurs prohibitive computational cost for unsteady and long-term simulations. To address these limitations, this study develops a composite operator-based neural network (CPNN) framework that leverages state-of-the-art operator learning to predict the spatial and temporal dynamics of hydraulics and particulate matter (PM) in stormwater treatment. The framework is demonstrated on a hydrodynamic separator (HS), a common urban treatment device. Results indicate that the CPNN achieves R2 > 0.8 for hydraulic predictions in 95.2% of test cases; for PM concentration predictions, R2 > 0.8 in 72.6% of cases and 0.4 < R2 < 0.8 in 22.6%. The analysis identifies challenges in capturing dynamics under extreme low-flow conditions, owing to their lower contribution to the training loss. Exploiting the automatic-differentiation capability of the CPNN, sensitivity analyses quantify the influence of storm event loading on PM transport. Finally, the potential of the CPNN framework for continuous, long-term evaluation of stormwater infrastructure performance is discussed, marking a step toward robust, climate-aware planning and implementation.
Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods
Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.
We present Photon Splatting, a physics-guided neural surrogate model for real-time wireless channel prediction in complex environments. The proposed framework introduces surface-attached virtual sources, referred to as photons, which carry directional wave signatures informed by the scene geometry and transmitter configuration. At runtime, channel impulse responses (CIRs) are predicted by splatting these photons onto the angular domain of the receiver using a geodesic rasterizer. The model is trained to learn a physically grounded representation that maps transmitter-receiver configurations to full channel responses. Once trained, it generalizes to new transmitter positions, antenna beam patterns, and mobile receivers without requiring model retraining. We demonstrate the effectiveness of the framework through a series of experiments, from canonical 3D scenes to a complex indoor cafe with 1,000 receivers. Results show 30 millisecond-level inference latency and accurate CIR predictions across a wide range of configurations. The approach supports real-time adaptability and interpretability, making it a promising candidate for wireless digital twin platforms and future 6G network planning.
We study a sequential decision-making problem motivated by recent regulatory and technological shifts that limit access to individual user data in recommender systems (RSs), leaving only population-level preference information. This privacy-aware setting poses fundamental challenges in planning under uncertainty: Effective personalization requires exploration to infer user preferences, yet unsatisfactory recommendations risk immediate user churn. To address this, we introduce the Rec-APC model, in which an anonymous user is drawn from a known prior over latent user types (e.g., personas or clusters), and the decision-maker sequentially selects items to recommend. Feedback is binary -- positive responses refine the posterior via Bayesian updates, while negative responses result in the termination of the session. We prove that optimal policies converge to pure exploitation in finite time and propose a branch-and-bound algorithm to efficiently compute them. Experiments on synthetic and MovieLens data confirm rapid convergence and demonstrate that our method outperforms the POMDP solver SARSOP, particularly when the number of user types is large or comparable to the number of content categories. Our results highlight the applicability of this approach and inspire new ways to improve decision-making under the constraints imposed by aggregated preference data.
Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
This paper introduces a Nonlinear Model Predictive Control (NMPC) framework for communication-aware motion planning of Multi-Rotor Aerial Vehicles (MRAVs) using Free-Space Optical (FSO) links. The scenario involves MRAVs equipped with body-fixed optical transmitters and Unmanned Ground Vehicles (UGVs) acting as mobile relays, each outfitted with fixed conical Field-of-View (FoV) receivers. The controller integrates optical connectivity constraints into the NMPC formulation to ensure beam alignment and minimum link quality, while also enabling UGV tracking and obstacle avoidance. The method supports both coplanar and tilted MRAV configurations. MATLAB simulations demonstrate its feasibility and effectiveness.
Unmanned Aerial Vehicles, operating in environments with relatively few obstacles, offer high maneuverability and full three-dimensional mobility. This allows them to rapidly approach objects and perform a wide range of tasks often challenging for ground robots, making them ideal for exploration, inspection, aerial imaging, and everyday assistance. In this paper, we introduce AirStar, a UAV-centric embodied platform that turns a UAV into an intelligent aerial assistant: a large language model acts as the cognitive core for environmental understanding, contextual reasoning, and task planning. AirStar accepts natural interaction through voice commands and gestures, removing the need for a remote controller and significantly broadening its user base. It combines geospatial knowledge-driven long-distance navigation with contextual reasoning for fine-grained short-range control, resulting in an efficient and accurate vision-and-language navigation (VLN) capability.Furthermore, the system also offers built-in capabilities such as cross-modal question answering, intelligent filming, and target tracking. With a highly extensible framework, it supports seamless integration of new functionalities, paving the way toward a general-purpose, instruction-driven intelligent UAV agent. The supplementary PPT is available at \href{https://buaa-colalab.github.io/airstar.github.io}{https://buaa-colalab.github.io/airstar.github.io}.
We introduce NourID+, a digital energy identity framework that addresses Morocco's need for trusted energy subsidy allocation through authenticated digital identity integration. NourID+ creates a strong foundation for future subsidy programs by unifying three government-issued and digitalized credentials: Moroccan national identity cards (CIN), cadastral plans, and property ownership certificates are transformed into unique digital energy IDs (DE-IDs) that map authenticated identities with specific properties and their energy consumption patterns. The system supports three property ownership profiles: farmers (landowners), entrepreneurs (factory or company owners), and households (house owners), as energy consumption is directly related to land ownership. NourID+ provides dual access through a government portal allowing officials to process DE-ID generation requests, as well as a citizen portal for DE-ID usage and energy monitoring. Our framework supports CIN upload with facial biometric matching, automated property retrieval through government APIs, and government officer approval workflow for DE-ID generation. After evaluation of the system, we demonstrate a reduction in verification time from weeks to minutes, with 98% accuracy of document validation. The proposed solution allows for targeted subsidy allocation of electricity based on actual consumption needs rather than estimations, potentially improving the efficiency of Morocco's significant energy subsidy expenditure.
This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.